1. Introduction
two years, we witnessed a race for sequence length in AI language models. We gradually evolved from 4k context length to 32k, then 128k, to the massive 1-million token window first promised by models like Gemini 1.5 pro. The promise was alluring: dump entire codebases or novels into the model and let it reason across the entire thing.
But there is a hidden cost to this virtually “infinite” context length, which is rarely ever mentioned: Memory.
In a standard Transformer architecture, memorising and reasoning across the entire prompt isn’t free. As the input sequence grows, the model must store the Key and Value (KV) states for every single token to calculate attention scores. For a 1-million-token sequence, this KV Cache can quickly snowball to hundreds of gigabytes, which in turn requires large clusters of GPUs across multiple data centres, all to just hold the conversation in memory.
2. The Motivation
In a standard attention mechanism (Vaswani et al., 2017)6, every new token that the model generates needs to “look back” to every previous token in the prompt to fully understand the context. To make this efficient over multiple generations, the model caches the Key (K) and Value (V) vectors of previous tokens in the GPU VRAM. This is known as the KV cache.
The Linear Growth Trap
While caching the Key and Value vectors (KV cache) can be time-efficient (as we don’t have to recompute the past for every new token), it has a huge memory footprint, which grows linearly with the input sequence length.
To put this into perspective: to store the KV cache for a standard 500B parameter model for a context of just 20,000 tokens requires about 126GB of memory. If we scale that to the parameter counts of modern LLM’s 1T+ parameters, and serving millions of users at any given time, the total memory footprint becomes an astronomically large figure.
Historically, we’ve had two ways to handle sequential data, neither of which is perfect:
- RNNs: Recurrent Neural Networks process the input prompt token by token, updating a single and fixed hidden state. While this can greatly reduce the memory requirements, they struggle to retain information and details over extended prompts. This causes the models to eventually forget the beginning of the input sequence by the time they get to the end.
- Transformers: Transformers, unlike RNNs, don’t suffer from this problem as they remember everything perfectly by keeping the entire history of the conversation in KV Cache. They have perfect recall, but due to the large KV cache, they are memory-intensive.
This is the trade-off that Infini-attention aims to fill.
3. The Solution: Infini-attention
To solve the memory paradox, researchers at Google formulated Infini-attention (Munkhdalai et al., 2024)1. The core principle of the approach is that instead of storing the entire conversation, we can store a summary of it.
Infini-attention splits the attention output into two distinct mechanisms, which work simultaneously:
- Local Attention: Same as a standard Transformer. It sees the immediate context and calculates an attention matrix for every token to capture details in high resolution.
- Global Linear Attention: A compressive memory that stores a summary of the entire past history in a fixed-size matrix, for the model to refer to.
Let’s walk through the pipeline of how this processes a long input.

Visualisation of how infini-attention works (Retrieval)
Step 1: Segmentation
Firstly, the entire input sequence is divided into smaller segments (say, N=2,048 tokens). Within each segment, the model uses the standard Dot-Product Attention to understand the context. This ensures that for immediate tasks, resolution remains perfect.
Step 2: The Compression (Memory Update)
To move on to the next segment, the model stores the compressed states of the Key (K) and Value (V) of the current segment into a fixed-size Memory Matrix (M). This allows the model to query the Memory Matrix (instead of the larger KV cache) to fetch information about the previous segments.
However, adding new data blindly to the Memory Matrix can quickly corrupt the previous information it was holding. To prevent this, the authors use the Delta Rule (Schlag et al., 2021)7. The intuition behind it is: Before adding any new information, check if the memory already stores it or not. This avoids redundant updates. The entire update process is explained below:
A. The “Peek” (Calculating Vretrieved)
Firstly, the model retrieves values from the existing memory using the current Keys (K) as if they were queries. The model does this to gauge what kind of information (values) the memory already associates with current keys.

K: Keys generated for the current segment
Mold: Global memory’s current state
σ: Non-Linear activation function (ELU+1)
z: Normalising factor
Vretrieved: Value matrix from global memory
B. The Update Step
The model then compares the actual new values (V) with the retrieved values (Vretrieved). It calculates the difference (the residual) and only adds that to the memory. This avoids updating the memory with what it already knows.

Mnew: Updated global memory
KT: Transposed Key matrix of current segment
V: Value matrix of the current segment
Vretrieved: Retrieved matrix vector from global memory
This implies that if the memory already contains the information of the current segment perfectly, the update is zero. This keeps the memory stable and “clean” over numerous updates.
Step 3: Global Retrieval (Linear Attention)
To generate the next token, the model needs the contextual information from the entire prompt, a.k.a., across all segments. To get the relevant information, the model queries the Memory Matrix by performing a matrix multiplication.

Amem: Attention output from global memory
Q: Query matrix of current segment
M: Global memory matrix
z: Normalising factor
The resulting Amem matrix contains the relevant information from all previous segments to generate the next token.
Step 4: The Aggregation (The “Mixer”)
Finally, the model has two outputs:
- Adot: The detailed, local context from the current segment.
- Amem: The compressed, global history of all previous segments from the memory matrix.
To combine the two, it uses a learned gating scalar, β (beta):

Sigmoid: Non-linear activation to bound β between 0 and 1
Amem and Adot: Attention outputs from global memory and dot-product, respectively
β: Learnt gating parameter to control the influence of Amem and Adot on the final output
The β parameter acts as a mixing coefficient that determines the trade-off between long-term (Amem) and short-term (Adot) information flows:
- When β is low: The sigmoid function approaches 0. This causes the complementary weighting factor (
1−sigmoid(β)) to become dominant, which causes the model to prioritise the local dot-product attention (Adot) more than the global compressive memory. - When β is high: The sigmoid function approaches 1. The model prioritises the retrieved memory content (Amem), allowing global context to override local information from the current segment.
4. The Results: Why Infini-attention Matters
The authors put Infini-attention to the test against existing long-context models, such as Transformer-XL (Dai et al., 2019)2 and Memorising Transformers (Wu et al., 2022)3. The following are the results:
1. The “114x” Memory Compression
The most impactful achievement of this paper is the massive reduction in memory resources used. As Infini-Attention stores the entire historical context in a fixed-size Memory Matrix instead of a linearly growing KV cache, it can get away with storing 114x fewer parameters into the GPU VRAM when compared to Memorising Transformers. As shown in the table below, for a context length of 65k tokens, Infini-Attention achieves SOTA perplexity scores on benchmarks like PG19 and Arxiv-math while needing to store only 1.6M parameters (size of the Memory Matrix), as opposed to competing architectures.

Infini-attention notably reduces memory footprint while achieving SOTA perplexity on PG19 and Arxiv-math benchmarks
2. The 1 Million Token “Passkey” Test
For a long-context architecture, the needle-in-a-haystack challenge is conventional. The authors tested this by hiding a random passkey in a massive corpus of text and asking the model to retrieve it. As shown in the table below, in a zero-shot setting, the model struggles to find the key, achieving mostly
The authors then fine-tuned the model for 400 steps with sequences that had a length of only 5,000 tokens. Remarkably, the model was able to generalise the fine-tuning to work with sequences up to 1 million tokens long, with drastically improved retrieval accuracy across the board.

The three scores per entry denote the accuracy of retrieval relative to the position of the passkey hidden in the corpus (start/middle/end).
3. State-of-the-Art Book Summarization (500k Context)
Apart from synthetic tests, the authors also tested the model on the BookSum benchmark (Kryściński et al.)5, where the model is required to generate a summary of a long novel. The 8B parameter Infini-Attention model set a new State-of-the-Art performance on the benchmark, by generating successful summaries of books up to 500,000 tokens long.
The results also show a clear trend that the model’s summarisation abilities improve as longer contexts are fed into it. The graph shown below validates this hypothesis, that instead of forgetting previous information (a common failure mode known as “lost-in-the-middle”), the model can effectively use the Memory Matrix to generate accurate summaries.

Rouge vs input length. Rouge measures how close an AI-generated summary is to a human-written summary based on lexical similarity.
4. Visualising the Gating Scalar
As an additional ablation study, the authors visualised the learnt gating scalar (β) to see how the model was using its new memory. Shown below is the heatmap of the resulting visualisation. The attention heads split into two distinct roles:
- Specialised Heads: Heads that have a score near 1 or 0, indicating that they choose to focus either on local context (within segment) or global history (previous segments).
- Mixer Heads: Heads that have scores near 0.5, indicating that their main role is to merge information from both pathways efficiently.
This suggests that the model can learn to switch between short-term/long-term recall and mix information across the entire sequence.

Visualisation of β reveals that attention heads tend to specialise for either global or local attention under the infini-attention architecture.
5. Conclusion
While it may not fully replace external Vector Databases and RAG systems for reasoning over static knowledge, it does, however, change how models process standard user queries. Integration of such architectures could be the next step forward to let out the research creativity, which earlier had to be bottlenecked by hardware advancements, ultimately accelerating progress in the field of language modelling.
👉If you liked this piece, I share shorter up-to-date writeups on Substack.
👉And if you want to support independent research writing, BuyMeACoffee helps keep it going.
6. References
- Infini-attention (Main Paper): Munkhdalai, T., Faruqui, M., & Gopal, S. (2024). Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention. arXiv preprint arXiv:2404.07143.
- Transformer-XL: Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q. V., & Salakhutdinov, R. (2019). Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context. arXiv preprint arXiv:1901.02860.
- Memorizing Transformers: Wu, Y., Rabe, M. N., Hutchins, D., & Szegedy, C. (2022). Memorizing Transformers. arXiv preprint arXiv:2203.08913.
- Linear Attention (The math foundation): Katharopoulos, A., Vyas, A., Pappas, N., & Fleuret, F. (2020). Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. International Conference on Machine Learning.
- BookSum Benchmark: Kryściński, W., Rajani, N., Agarwal, D., Xiong, C., & Radev, D. (2021). BookSum: A Collection of Datasets for Long-form Narrative Summarization. arXiv preprint arXiv:2105.08209.
- Standard Attention: Vaswani, Ashish, et al. “Attention is all you need.” Advances in neural information processing systems 30 (2017).
- Delta Rule: Schlag, Imanol, Kazuki Irie, and Jürgen Schmidhuber. “Linear transformers are secretly fast weight programmers.” International conference on machine learning. PMLR, 2021.
Source link
#LLMs #Handle #Infinite #Context #Finite #Memory
























