Mixture-of-Memories (MoM): The “Linear Attention” Breakthrough That Doesn’t Forget Long Contexts

Mixture-of-Memories (MoM): The “Linear Attention” Breakthrough That Doesn’t Forget Long Contexts

  • MoM was released in May 2025 as a linear-attention sequence model that preserves long-term context without forgetting.
  • MoM breaks the single-memory bottleneck by maintaining multiple independent memory states and using a token router to assign information to memory slots.
  • A trained router assigns each token to the top-k most relevant memory modules, with typical settings like top-k=2.
  • Activated memories are updated by adding the token’s key and value outer-product to the memory, while a shared global memory is updated concurrently.
  • MoM achieves linear time complexity per token and linear scaling with sequence length, unlike Transformers which are quadratic.
  • In benchmarks, MoM outperformed prior linear models and sometimes surpassed a standard Transformer baseline of similar scale on recall-intensive tasks such as WikiText, LAMBADA, ARC QA, HellaSwag, PiQA, and WinoGrande.
  • On LongBench, MoM achieved an average score of about 15.6, higher than the 13–14 range of the next-best linear models, outperforming RetNet and Gated DeltaNet.
  • Ablation showed multiple memories outperform a single large memory of equal capacity, with best results when only about four memory modules are active.
  • Inference speed and memory usage scale linearly with context length, and MoM used far less GPU memory and time than a Transformer with flash-attention on long inputs.
  • The authors released code as open source and it has been integrated into Hugging Face and related libraries.

Mixture-of-Memories (MoM) is a new sequence modeling architecture (released May 2025) that promises “linear attention” without the usual amnesia. In other words, MoM retains long-term information far better than previous efficient Transformer alternatives, yet keeps linear time complexity. Developed by researchers from Shanghai AI Lab and collaborators, MoM introduces multiple independent memory states guided by a token router to eliminate “memory interference” – the tendency of new inputs to overwrite old memories [1] [2]. Thanks to this design, MoM achieves exceptional recall on long sequences, outperforming prior linear models and even rivaling full Transformer accuracy on recall-intensive tasks [3]. Experts are hailing it as a breakthrough that bridges the gap between efficient models and standard Transformers, without forgetting the past.

The Memory Interference Problem in Linear Models

Traditional Transformers are powerful at modeling long-range dependencies but suffer from quadratic complexity, making long sequences computationally expensive [4]. Recent linear sequence models – such as linear attention Transformers, state-space models (SSMs), and linear RNNs – tackle this by compressing the entire input into a single fixed-size hidden state (a single “memory”) for efficiency [5]. Unfortunately, this extreme compression comes at a cost: limited memory capacity and severe memory interference. As new information streams in, it overwrites the lone memory vector, causing previous information to degrade [6]. One commentator quipped that compressing an entire sequence into one state is “like trying to cram your entire life’s memories into a single, slightly dented USB drive. Something’s gotta give.” [7] In other words, these efficient models often “forget” earlier context, leading to subpar performance on tasks requiring long-term recall.

By contrast, a standard Transformer avoids this issue by storing separate key-value vectors for each token, essentially keeping independent memory for every position [8]. This grants Transformers virtually unlimited memory capacity and no interference between tokens’ representations – at the expense of much higher computation and memory usage. Prior research attempted to mitigate forgetting in linear models by adding forgetting gates or increasing the size of the single memory state [9] [10]. While such tweaks (inspired by RNN gating mechanisms) can slow down information decay, they only partially alleviate the problem [11]. Ultimately, packing all knowledge into one state remained a fundamental bottleneck. As the MoM authors summarize, “When new information overwrites the single memory, previously stored representations may degrade, negatively impacting long-term recall.” [12] The field clearly needed a new approach to retain rich long-term information without sacrificing efficiency.

MoM’s Solution: Multiple Memories and Token Routing (No More “Amnesia”)

Mixture-of-Memories (MoM) tackles the above problem by breaking the single-memory paradigm. Inspired by neuroscience – notably how the human hippocampus uses distinct oscillatory cycles to store multiple items without interference [13] – MoM maintains multiple independent memory states in parallel, rather than one monolithic state [14]. Each memory can specialize in storing different parts or aspects of the sequence. A router network learns to dynamically assign each incoming token to one (or a few) of these memory slots, deciding which memory should store that token’s information [15]. In practice, the router computes an importance score for each token-memory pair (via a learned projection) and activates the top-$k$ most relevant memory modules for that token [16]. This means at each time step, a token updates only a subset of memory states, leaving the others untouched – preventing new inputs from overwriting all existing info.

Once assigned, the token’s content is written into the selected memory state(s) via a simple update rule. The authors implement this as generating a key vector and value vector from the token, then adding their outer-product to the memory matrix (akin to a rank-1 update) [17]. In essence, the memory module accumulates information from tokens routed to it, much like an independent “expert” focusing on a subset of the sequence. Crucially, memory states that are not activated by a token remain unchanged, preserving their previous content intact [18]. This sparse update strategy is the key to avoiding interference – different streams of information do not collide in the same storage. As the team explains, “each input token selectively activates and updates memory states, leaving non-activated states unchanged to avoid interference from the current input.” [19]

After updating the relevant memories, MoM produces its output for that time step by aggregating across all memory states. Essentially, it computes a weighted mixture of the memory vectors (hence “mixture-of-memories”), often by querying them with the next-layer’s query vector and summing the results [20]. In this way, the model can draw on multiple distinct memory traces to inform its predictions, rather than a single blended trace. Notably, MoM also includes a “shared” global memory that every token updates (alongside the specialized ones) [21]. This shared memory continuously accumulates general context from the whole sequence, ensuring that no information is completely lost even if not captured in a specific slot [22]. It acts like an always-on backlog memory to assist with very long-term dependencies.

The overall design is reminiscent of a mixture-of-experts architecture but applied to sequence memory: each memory slot is like an expert specializing in part of the sequence, and the router directs tokens to appropriate slots. Indeed, commentators noted that “Mixture-of-Experts have been doing something similar for ages” in splitting workloads across modules, though MoM uniquely applies it to internal state rather than output layers. The MoM authors themselves highlight that this “sparsely and [virtually] unbounded expansion of memory” breaks free from the usual routine of just tweaking gates or RNN-style updates [23] [24]. By separating memory into multiple compartments, MoM dramatically expands memory capacity and eliminates interference, while still operating with linear time complexity per token [25] [26]. Each individual memory state’s update is linear in sequence length, and since only a fixed number of memories are activated per token, the overall computation scales linearly with sequence length (with a constant factor for the few activated memories). Importantly, MoM’s inference complexity remains constant per step – like other linear models, it doesn’t grow with context length, since it doesn’t need to attend across all prior tokens at each step [27] [28]. In summary, MoM finds a sweet spot between Transformers’ explicit memory for every token and linear models’ efficient compression, yielding an architecture that “significantly enhances memory capacity while retaining linear efficiency.” [29]

Recall Performance: Outshining Linear Peers and Rivalling Transformers

The true test of MoM is whether this new architecture translates to better performance on tasks that demand long-term reasoning and recall. Experimental results so far are very promising. Across a variety of language modeling and understanding benchmarks, MoM consistently outperforms state-of-the-art linear models (like prior linear Transformers, SSMs, or gated RNN variants) and in many cases closes the gap to a full Transformer [30]. For instance, on commonsense language understanding tasks (WikiText, LAMBADA, ARC QA, HellaSwag, PiQA, WinoGrande, etc.), a MoM-based model not only beat other efficient models but even surpassed a standard Transformer baseline of similar scale [31]. This is striking – it suggests MoM’s recall abilities can match or exceed a vanilla Transformer on certain tasks, despite using linear complexity. The authors attribute this to MoM’s ability to avoid forgetting: Transformers had long been superior on “recall-intensive tasks” simply by virtue of storing more information [32] [33], but MoM narrows that advantage by remembering past tokens almost as robustly, without quadratic cost.

To specifically test long-range capability, the researchers evaluated MoM on LongBench, a benchmark suite for long-context understanding (covering summarization, few-shot learning, code completion, synthetic long-range tasks, etc.) [34]. MoM achieved the highest overall score among comparable models, outperforming recent long-context models like RetNet and Gated DeltaNet in categories like summarization and code tasks [35]. For example, MoM’s average score across LongBench tasks was about 15.6, versus 13–14 for the next-best linear models [36] – a significant leap in this regime. This demonstrates that MoM’s multi-memory mechanism isn’t just a theoretical fix; it measurably improves comprehension of very long inputs in practice. The team also ran an ablation study to confirm the benefit of multiple memories over a single large memory. They gave a baseline model the same total memory size as MoM (but as one unified state) and found it performed worse than MoM’s separated memories [37] [38]. In other words, simply having more capacity isn’t enough – it’s the separation of that capacity into distinct memory slots that reduces interference and boosts recall [39]. As the paper reports, using multiple mixed memories yielded greater improvement than an expanded single memory, “confirming that mixed memory can effectively reduce interference from different inputs” [40]. Notably, MoM with targeted memory modules plus a forget gate outperformed a model that relied only on a forget gate, underscoring that isolating memories is more effective than gating alone [41].

Despite introducing multiple memory modules, MoM retains impressive efficiency. The authors show that inference speed and memory usage scale linearly with sequence length, in contrast to a quadratic-scaling Transformer [42]. For example, when generating text with very long inputs, MoM required far less GPU memory and time than a Transformer with flash-attention optimization, especially as context length grew [43]. This means MoM can handle longer sequences on fixed hardware, making it more scalable for practical use. Training convergence is also improved – MoM models learn faster and reach lower loss than baseline models of similar size [44]. Researchers observed that throughout training on a large corpus, MoM maintained the lowest loss curve, indicating it learns more efficiently, likely because it doesn’t suffer as much from forgetting earlier data [45]. All these results point to a single conclusion: MoM effectively solves the memory interference issue that plagued linear models, delivering recall and accuracy much closer to full Transformers while preserving the speed and scalability advantages of linear complexity [46] [47].

Expert Perspectives and Future Directions

MoM’s debut has generated excitement in the AI research community, as well as a healthy dose of scrutiny. Many see it as a major step toward making long-context models more practical. A literature review praised MoM as “a significant contribution to the field of machine learning and NLP,” noting that it leverages insights from neuroscience to achieve robust long-term memory without heavy computation [48]. By tackling the fundamental memory bottleneck, MoM could “pave the way for future research into memory management and effective sequence processing techniques,” potentially influencing how next-generation large language models handle context [49]. The fact that the code has been open-sourced [50] and integrated into libraries (e.g. Hugging Face and a GitHub repo [51] [52]) means researchers are already experimenting with MoM or building on it. Indeed, MoM is compatible with various linear modeling backbones (attention, SSMs, RNNs), so it can be dropped into many architectures as a memory upgrade [53]. This flexibility suggests we may soon see hybrid models that use MoM-style memory modules in other efficient transformers or even combined with Mixture-of-Experts (MoE) for greater scalability (the authors’ group has explored a “Linear-MoE” approach alongside MoM) [54] [55].

At the same time, some experts have pointed out that MoM’s concept shares DNA with existing ideas. The use of a routing network to dispatch tokens to different states is reminiscent of MoE layers that route tokens to different expert networks – a technique proven to scale models effectively. What MoM adds is applying this idea internally to the sequence representation itself, grounded in a neuroscience analogy. There is a hint of skepticism in some quarters that MoM’s gains come essentially from adding more parameters or memory slots – as one industry observer joked, “the academic equivalent of saying ‘my algorithm uses more memory, therefore it’s better.’” [56] However, the MoM paper directly addresses this: by testing a single large memory vs multiple smaller ones, they showed it’s not just more memory but how it’s organized that matters [57]. In practice, MoM achieved its best results with only a handful of memory slots active per token (e.g. top-$k=2$ and around 4 total memory modules) [58]. Adding too many separate memories can even hurt performance slightly [59], implying there is an optimal sparse structure rather than “the more the merrier.” This insight aligns with the brain inspiration – a few distinct neural assemblies can encode items without interference, but you don’t need an extreme number of them for effective memory [60].

Looking ahead, researchers are exploring refinements to MoM and its broader applications. One avenue is to improve the router network – making the token-to-memory assignment smarter or more adaptive could further boost performance [61]. The current router uses a learned linear projection and top-$k$ selection; future versions might use more sophisticated gating or even reinforcement learning to decide how to partition information. Another direction is scaling up MoM models. The published experiments were on models up to a few hundred million parameters [62]. It will be interesting to see MoM applied to billion-parameter language models or beyond, potentially yielding large models that maintain high recall over book-length inputs. Because MoM is architecture-agnostic on the backend (it can work with any linear recurrent layer), it could also be tried in domains beyond text – e.g. long time-series forecasting or video understanding – where retaining long-term history is crucial. As one AI blog noted, “the combination of linear complexity and high performance makes MoM an attractive approach for applications where both efficiency and accuracy are critical.” [63] In short, MoM opens up new possibilities to build models that “never forget” important context, without running into the scalability wall that vanilla Transformers face.

Conclusion: Mixture-of-Memories represents a compelling innovation in sequence modeling, effectively granting “long-term memory” to efficient Transformer alternatives. By eliminating the amnesia effect of linear attention through multiple memory states and intelligent token routing, MoM closes the performance gap to Transformers on long-range tasks [64]. It stands as a proof-of-concept that we can have the best of both worlds – near-Transformer performance with linear efficiency – by rethinking how and where models store information. As the research evolves and larger MoM-based models emerge, we may witness a new generation of AI systems that handle lengthy inputs and complex reasoning with ease, all while operating at practical speeds. In the quest to build ever more capable and context-aware AI, MoM’s “no forgetting” approach could be a game-changer [65] [66].

Sources:

  1. Du, J. et al. (2025). “MoM: Linear Sequence Modeling with Mixture-of-Memories.” arXiv preprint [67] [68] [69]
  2. Du, J. et al. (2025). MoM: Linear Sequence Modeling with Mixture-of-Memories. Introduction & Contributions [70] [71]
  3. Moonlight AI Review: “Mixture-of-Memories aimed at improving linear sequence modeling, especially on recall-intensive tasks.” (Feb 2025) [72] [73] [74]
  4. AIverse Blog: “More Efficient Sequence Modeling with Mixture-of-Memories (MoM)”. (2025) [75] [76]
  5. SciTech Access Commentary: “Alright, let’s dissect this MoM paper” (Feb 2025) – humorous analysis [77] [78]
  6. Du et al., MoM Experiments – Commonsense and Long-Context Results (2025) [79] [80] [81]
  7. Du et al., MoM Figure 1 & Ablation details (2025) [82] [83]

References

1. ar5iv.labs.arxiv.org, 2. ar5iv.labs.arxiv.org, 3. ar5iv.labs.arxiv.org, 4. ar5iv.labs.arxiv.org, 5. ar5iv.labs.arxiv.org, 6. ar5iv.labs.arxiv.org, 7. scitechaccess.com, 8. ar5iv.labs.arxiv.org, 9. ar5iv.labs.arxiv.org, 10. ar5iv.labs.arxiv.org, 11. ar5iv.labs.arxiv.org, 12. ar5iv.labs.arxiv.org, 13. ar5iv.labs.arxiv.org, 14. ar5iv.labs.arxiv.org, 15. www.themoonlight.io, 16. www.themoonlight.io, 17. www.themoonlight.io, 18. ar5iv.labs.arxiv.org, 19. ar5iv.labs.arxiv.org, 20. www.themoonlight.io, 21. ar5iv.labs.arxiv.org, 22. www.themoonlight.io, 23. ar5iv.labs.arxiv.org, 24. ar5iv.labs.arxiv.org, 25. ar5iv.labs.arxiv.org, 26. ar5iv.labs.arxiv.org, 27. arxiv.org, 28. ar5iv.labs.arxiv.org, 29. ar5iv.labs.arxiv.org, 30. ar5iv.labs.arxiv.org, 31. ar5iv.labs.arxiv.org, 32. scitechaccess.com, 33. ar5iv.labs.arxiv.org, 34. ar5iv.labs.arxiv.org, 35. ar5iv.labs.arxiv.org, 36. ar5iv.labs.arxiv.org, 37. ar5iv.labs.arxiv.org, 38. ar5iv.labs.arxiv.org, 39. ar5iv.labs.arxiv.org, 40. ar5iv.labs.arxiv.org, 41. ar5iv.labs.arxiv.org, 42. ar5iv.labs.arxiv.org, 43. ar5iv.labs.arxiv.org, 44. ar5iv.labs.arxiv.org, 45. ar5iv.labs.arxiv.org, 46. ar5iv.labs.arxiv.org, 47. ar5iv.labs.arxiv.org, 48. www.themoonlight.io, 49. www.themoonlight.io, 50. arxiv.org, 51. huggingface.co, 52. github.com, 53. github.com, 54. www.getaiverse.com, 55. www.alphaxiv.org, 56. scitechaccess.com, 57. ar5iv.labs.arxiv.org, 58. ar5iv.labs.arxiv.org, 59. ar5iv.labs.arxiv.org, 60. ar5iv.labs.arxiv.org, 61. www.getaiverse.com, 62. ar5iv.labs.arxiv.org, 63. www.getaiverse.com, 64. ar5iv.labs.arxiv.org, 65. www.themoonlight.io, 66. www.getaiverse.com, 67. ar5iv.labs.arxiv.org, 68. ar5iv.labs.arxiv.org, 69. ar5iv.labs.arxiv.org, 70. ar5iv.labs.arxiv.org, 71. ar5iv.labs.arxiv.org, 72. www.themoonlight.io, 73. www.themoonlight.io, 74. www.themoonlight.io, 75. www.getaiverse.com, 76. www.getaiverse.com, 77. scitechaccess.com, 78. scitechaccess.com, 79. ar5iv.labs.arxiv.org, 80. ar5iv.labs.arxiv.org, 81. ar5iv.labs.arxiv.org, 82. ar5iv.labs.arxiv.org, 83. ar5iv.labs.arxiv.org

Samsung Galaxy Buds 3 Pro vs. AirPods, Sony, Bose: Ultimate Earbud Showdown
Previous Story

Samsung Galaxy Buds 3 Pro naspram AirPods, Sony, Bose: Konačna bitka slušalica

Live Satellite Images and Real-Time Maps: Top Platforms for Web & Mobile
Next Story

Сателитске слике уживо и мапе у реалном времену: Најбоље платформе за веб и мобилне уређаје

Stock Market Today

  • ENVA Crosses Below 200-Day Moving Average; Enova International Stock Signals Potential Technical Move
    October 11, 2025, 3:19 AM. ENVA, the ticker for Enova International Inc., slipped to as low as $103.41 on Friday after crossing below its 200-day moving average of $104.32. The stock was off about 3.7% on the session, with the latest trade near $104.01 and the one-year chart showing a pullback relative to the longer-term average. The 52-week range runs from a low around $79.41 to a high near $130.34, underscoring recent volatility. Traders watching technical levels may note the break below the 200-day moving average as a potential signal for further downside or a nearby support test.
  • SouthState Bank Breaks Below 200-Day Moving Average
    October 11, 2025, 3:18 AM. SouthState Bank Corp (SSB) traded below its 200-day moving average of $95.18 on Friday, hitting as low as $93.81 and finishing about 5% lower for the session. The last trade was around $93.88, keeping the stock under the long-term indicator. The 52-week range spans from a low of $77.74 to a high of $114.265. The break highlights shifting momentum for SSB and raises questions about whether additional downside lies ahead or if a rebound could form near the 200-day line. Investors will watch how the chart action compares with peers and any implications for risk in regional banks.
  • PNC Financial Shares Slip Below 200-Day Moving Average
    October 11, 2025, 3:17 AM. On Friday, shares of PNC Financial Services Group (PNC) fell below their 200-day moving average of $185.98, trading as low as $184.48 per share. The stock was down about 3.2% on the session. The chart highlights PNC’s performance over the past year against its DMA, with the latest last trade near $184.82. The year range shows a 52-week low of $145.12 and a 52-week high of $216.26. The DMA break is noted as a potential bearish signal by technicians, with data from TechnicalAnalysisChannel.com.
  • Graham Holdings Co. (GHC) Drops Below 200-Day Moving Average
    October 11, 2025, 3:16 AM. On Thursday, Graham Holdings Co. (ticker: GHC) crossed below its 200-day moving average of $597.36, trading as low as $591.78. The stock was down about 1.1% on the session. The chart highlights the stock's 1-year performance relative to the 200-day MA. In the 52-week range, the low is $547.75 and the high is $675, with the last trade around $594.32. The note links to a feature on multiple stocks crossing below the 200-day average.
  • Crypto Crash: Bitcoin Slumps as Trump Tariff Threat Triggers Flash Crash; ETH, XRP, SOL Plunge
    October 11, 2025, 3:05 AM. Friday's crypto rout intensified after President Trump announced a 100% tariff on Chinese goods, triggering a flash crash. Bitcoin fell below $110,000, down roughly 12% in 24 hours, while ETH slid about 16% to under $3,700. Major altcoins XRP, SOL, and DOGE tumbled 20%-30%, with ADA, LINK, and AAVE plunging as much as 40%. The sell-off pushed over $7 billion in liquidations, underscoring a liquidity crunch as traders reeled from the news and risk-off sentiment. Some analysts called it the worst move since the COVID crash, describing it as a brutal day and a 'mother of shakeouts' for crypto.
Go toTop