• About
  • Advertise
  • Privacy & Policy
  • Contact
Saturday, January 10, 2026
  • Login
  • Home
    • Home – Layout 1
    • Home – Layout 2
    • Home – Layout 3
    • Home – Layout 4
    • Home – Layout 5
    • Home – Layout 6
  • News
    • All
    • Business
    • Politics
    • Science
    • World
    Hillary Clinton in white pantsuit for Trump inauguration

    Hillary Clinton in white pantsuit for Trump inauguration

    Amazon has 143 billion reasons to keep adding more perks to Prime

    Amazon has 143 billion reasons to keep adding more perks to Prime

    Shooting More than 40 Years of New York’s Halloween Parade

    Shooting More than 40 Years of New York’s Halloween Parade

    These Are the 5 Big Tech Stories to Watch in 2017

    These Are the 5 Big Tech Stories to Watch in 2017

    Why Millennials Need to Save Twice as Much as Boomers Did

    Why Millennials Need to Save Twice as Much as Boomers Did

    Doctors take inspiration from online dating to build organ transplant AI

    Doctors take inspiration from online dating to build organ transplant AI

    Trending Tags

    • Trump Inauguration
    • United Stated
    • White House
    • Market Stories
    • Election Results
  • Tech
    • All
    • Apps
    • Gadget
    • Mobile
    • Startup
    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    Shadow Tactics: Blades of the Shogun Review

    Shadow Tactics: Blades of the Shogun Review

    macOS Sierra review: Mac users get a modest update this year

    macOS Sierra review: Mac users get a modest update this year

    Hands on: Samsung Galaxy A5 2017 review

    Hands on: Samsung Galaxy A5 2017 review

    The Last Guardian Playstation 4 Game review

    The Last Guardian Playstation 4 Game review

    These Are the 5 Big Tech Stories to Watch in 2017

    These Are the 5 Big Tech Stories to Watch in 2017

    Trending Tags

    • Nintendo Switch
    • CES 2017
    • Playstation 4 Pro
    • Mark Zuckerberg
  • Entertainment
    • All
    • Gaming
    • Movie
    • Music
    • Sports
    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    macOS Sierra review: Mac users get a modest update this year

    macOS Sierra review: Mac users get a modest update this year

    Hands on: Samsung Galaxy A5 2017 review

    Hands on: Samsung Galaxy A5 2017 review

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Harnessing the power of VR with Power Rangers and Snapdragon 835

    Harnessing the power of VR with Power Rangers and Snapdragon 835

    So you want to be a startup investor? Here are things you should know

    So you want to be a startup investor? Here are things you should know

  • Lifestyle
    • All
    • Fashion
    • Food
    • Health
    • Travel
    Shooting More than 40 Years of New York’s Halloween Parade

    Shooting More than 40 Years of New York’s Halloween Parade

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Why Millennials Need to Save Twice as Much as Boomers Did

    Why Millennials Need to Save Twice as Much as Boomers Did

    Doctors take inspiration from online dating to build organ transplant AI

    Doctors take inspiration from online dating to build organ transplant AI

    How couples can solve lighting disagreements for good

    How couples can solve lighting disagreements for good

    Ducati launch: Lorenzo and Dovizioso’s Desmosedici

    Ducati launch: Lorenzo and Dovizioso’s Desmosedici

    Trending Tags

    • Golden Globes
    • Game of Thrones
    • MotoGP 2017
    • eSports
    • Fashion Week
  • Review
    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    Shadow Tactics: Blades of the Shogun Review

    Shadow Tactics: Blades of the Shogun Review

    macOS Sierra review: Mac users get a modest update this year

    macOS Sierra review: Mac users get a modest update this year

    Hands on: Samsung Galaxy A5 2017 review

    Hands on: Samsung Galaxy A5 2017 review

    The Last Guardian Playstation 4 Game review

    The Last Guardian Playstation 4 Game review

    Intel Core i7-7700K ‘Kaby Lake’ review

    Intel Core i7-7700K ‘Kaby Lake’ review

No Result
View All Result
Ai News
Advertisement
  • Home
    • Home – Layout 1
    • Home – Layout 2
    • Home – Layout 3
    • Home – Layout 4
    • Home – Layout 5
    • Home – Layout 6
  • News
    • All
    • Business
    • Politics
    • Science
    • World
    Hillary Clinton in white pantsuit for Trump inauguration

    Hillary Clinton in white pantsuit for Trump inauguration

    Amazon has 143 billion reasons to keep adding more perks to Prime

    Amazon has 143 billion reasons to keep adding more perks to Prime

    Shooting More than 40 Years of New York’s Halloween Parade

    Shooting More than 40 Years of New York’s Halloween Parade

    These Are the 5 Big Tech Stories to Watch in 2017

    These Are the 5 Big Tech Stories to Watch in 2017

    Why Millennials Need to Save Twice as Much as Boomers Did

    Why Millennials Need to Save Twice as Much as Boomers Did

    Doctors take inspiration from online dating to build organ transplant AI

    Doctors take inspiration from online dating to build organ transplant AI

    Trending Tags

    • Trump Inauguration
    • United Stated
    • White House
    • Market Stories
    • Election Results
  • Tech
    • All
    • Apps
    • Gadget
    • Mobile
    • Startup
    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    Shadow Tactics: Blades of the Shogun Review

    Shadow Tactics: Blades of the Shogun Review

    macOS Sierra review: Mac users get a modest update this year

    macOS Sierra review: Mac users get a modest update this year

    Hands on: Samsung Galaxy A5 2017 review

    Hands on: Samsung Galaxy A5 2017 review

    The Last Guardian Playstation 4 Game review

    The Last Guardian Playstation 4 Game review

    These Are the 5 Big Tech Stories to Watch in 2017

    These Are the 5 Big Tech Stories to Watch in 2017

    Trending Tags

    • Nintendo Switch
    • CES 2017
    • Playstation 4 Pro
    • Mark Zuckerberg
  • Entertainment
    • All
    • Gaming
    • Movie
    • Music
    • Sports
    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    macOS Sierra review: Mac users get a modest update this year

    macOS Sierra review: Mac users get a modest update this year

    Hands on: Samsung Galaxy A5 2017 review

    Hands on: Samsung Galaxy A5 2017 review

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Harnessing the power of VR with Power Rangers and Snapdragon 835

    Harnessing the power of VR with Power Rangers and Snapdragon 835

    So you want to be a startup investor? Here are things you should know

    So you want to be a startup investor? Here are things you should know

  • Lifestyle
    • All
    • Fashion
    • Food
    • Health
    • Travel
    Shooting More than 40 Years of New York’s Halloween Parade

    Shooting More than 40 Years of New York’s Halloween Parade

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Why Millennials Need to Save Twice as Much as Boomers Did

    Why Millennials Need to Save Twice as Much as Boomers Did

    Doctors take inspiration from online dating to build organ transplant AI

    Doctors take inspiration from online dating to build organ transplant AI

    How couples can solve lighting disagreements for good

    How couples can solve lighting disagreements for good

    Ducati launch: Lorenzo and Dovizioso’s Desmosedici

    Ducati launch: Lorenzo and Dovizioso’s Desmosedici

    Trending Tags

    • Golden Globes
    • Game of Thrones
    • MotoGP 2017
    • eSports
    • Fashion Week
  • Review
    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    Shadow Tactics: Blades of the Shogun Review

    Shadow Tactics: Blades of the Shogun Review

    macOS Sierra review: Mac users get a modest update this year

    macOS Sierra review: Mac users get a modest update this year

    Hands on: Samsung Galaxy A5 2017 review

    Hands on: Samsung Galaxy A5 2017 review

    The Last Guardian Playstation 4 Game review

    The Last Guardian Playstation 4 Game review

    Intel Core i7-7700K ‘Kaby Lake’ review

    Intel Core i7-7700K ‘Kaby Lake’ review

No Result
View All Result
Ai News
No Result
View All Result
Home Machine Learning

How LLMs Handle Infinite Context With Finite Memory

AiNEWS2025 by AiNEWS2025
2026-01-09
in Machine Learning
0
How LLMs Handle Infinite Context With Finite Memory
0
SHARES
0
VIEWS
Share on FacebookShare on Twitter


1. Introduction

two years, we witnessed a race for sequence length in AI language models. We gradually evolved from 4k context length to 32k, then 128k, to the massive 1-million token window first promised by models like Gemini 1.5 pro. The promise was alluring: dump entire codebases or novels into the model and let it reason across the entire thing.

But there is a hidden cost to this virtually “infinite” context length, which is rarely ever mentioned: Memory.

In a standard Transformer architecture, memorising and reasoning across the entire prompt isn’t free. As the input sequence grows, the model must store the Key and Value (KV) states for every single token to calculate attention scores. For a 1-million-token sequence, this KV Cache can quickly snowball to hundreds of gigabytes, which in turn requires large clusters of GPUs across multiple data centres, all to just hold the conversation in memory.

2. The Motivation

In a standard attention mechanism (Vaswani et al., 2017)6, every new token that the model generates needs to “look back” to every previous token in the prompt to fully understand the context. To make this efficient over multiple generations, the model caches the Key (K) and Value (V) vectors of previous tokens in the GPU VRAM. This is known as the KV cache.

The Linear Growth Trap

While caching the Key and Value vectors (KV cache) can be time-efficient (as we don’t have to recompute the past for every new token), it has a huge memory footprint, which grows linearly with the input sequence length.

To put this into perspective: to store the KV cache for a standard 500B parameter model for a context of just 20,000 tokens requires about 126GB of memory. If we scale that to the parameter counts of modern LLM’s 1T+ parameters, and serving millions of users at any given time, the total memory footprint becomes an astronomically large figure.

Historically, we’ve had two ways to handle sequential data, neither of which is perfect:

  1. RNNs: Recurrent Neural Networks process the input prompt token by token, updating a single and fixed hidden state. While this can greatly reduce the memory requirements, they struggle to retain information and details over extended prompts. This causes the models to eventually forget the beginning of the input sequence by the time they get to the end.
  2. Transformers: Transformers, unlike RNNs, don’t suffer from this problem as they remember everything perfectly by keeping the entire history of the conversation in KV Cache. They have perfect recall, but due to the large KV cache, they are memory-intensive.

This is the trade-off that Infini-attention aims to fill.

3. The Solution: Infini-attention

To solve the memory paradox, researchers at Google formulated Infini-attention (Munkhdalai et al., 2024)1. The core principle of the approach is that instead of storing the entire conversation, we can store a summary of it.

Infini-attention splits the attention output into two distinct mechanisms, which work simultaneously:

  1. Local Attention: Same as a standard Transformer. It sees the immediate context and calculates an attention matrix for every token to capture details in high resolution.
  2. Global Linear Attention: A compressive memory that stores a summary of the entire past history in a fixed-size matrix, for the model to refer to.

Let’s walk through the pipeline of how this processes a long input.

(Source: Author)
Visualisation of how infini-attention works (Retrieval)

Step 1: Segmentation

Firstly, the entire input sequence is divided into smaller segments (say, N=2,048 tokens). Within each segment, the model uses the standard Dot-Product Attention to understand the context. This ensures that for immediate tasks, resolution remains perfect.

Step 2: The Compression (Memory Update)

To move on to the next segment, the model stores the compressed states of the Key (K) and Value (V) of the current segment into a fixed-size Memory Matrix (M). This allows the model to query the Memory Matrix (instead of the larger KV cache) to fetch information about the previous segments.

However, adding new data blindly to the Memory Matrix can quickly corrupt the previous information it was holding. To prevent this, the authors use the Delta Rule (Schlag et al., 2021)7. The intuition behind it is: Before adding any new information, check if the memory already stores it or not. This avoids redundant updates. The entire update process is explained below:

A. The “Peek” (Calculating Vretrieved)

Firstly, the model retrieves values from the existing memory using the current Keys (K) as if they were queries. The model does this to gauge what kind of information (values) the memory already associates with current keys.

(Source: Author)
K: Keys generated for the current segment
Mold: Global memory’s current state
σ: Non-Linear activation function (ELU+1)
z: Normalising factor
Vretrieved: Value matrix from global memory

B. The Update Step

The model then compares the actual new values (V) with the retrieved values (Vretrieved​). It calculates the difference (the residual) and only adds that to the memory. This avoids updating the memory with what it already knows.

(Source: Author)
Mnew: Updated global memory
KT: Transposed Key matrix of current segment
V: Value matrix of the current segment
Vretrieved: Retrieved matrix vector from global memory

This implies that if the memory already contains the information of the current segment perfectly, the update is zero. This keeps the memory stable and “clean” over numerous updates.

Step 3: Global Retrieval (Linear Attention)

To generate the next token, the model needs the contextual information from the entire prompt, a.k.a., across all segments. To get the relevant information, the model queries the Memory Matrix by performing a matrix multiplication.

(Source: Author)
Amem: Attention output from global memory
Q: Query matrix of current segment
M: Global memory matrix
z: Normalising factor

The resulting Amem matrix contains the relevant information from all previous segments to generate the next token.

Step 4: The Aggregation (The “Mixer”)

Finally, the model has two outputs:

  1. Adot: The detailed, local context from the current segment.
  2. Amem: The compressed, global history of all previous segments from the memory matrix.

To combine the two, it uses a learned gating scalar, β (beta):

(Source: Author)
Sigmoid: Non-linear activation to bound β between 0 and 1
Amem and Adot: Attention outputs from global memory and dot-product, respectively
β: Learnt gating parameter to control the influence of Amem and Adot on the final output

The β parameter acts as a mixing coefficient that determines the trade-off between long-term (Amem) and short-term (Adot) information flows:

  • When β is low: The sigmoid function approaches 0. This causes the complementary weighting factor (1−sigmoid(β)) to become dominant, which causes the model to prioritise the local dot-product attention (Adot​) more than the global compressive memory.
  • When β is high: The sigmoid function approaches 1. The model prioritises the retrieved memory content (Amem​), allowing global context to override local information from the current segment.

4. The Results: Why Infini-attention Matters

The authors put Infini-attention to the test against existing long-context models, such as Transformer-XL (Dai et al., 2019)2 and Memorising Transformers (Wu et al., 2022)3. The following are the results:

1. The “114x” Memory Compression

The most impactful achievement of this paper is the massive reduction in memory resources used. As Infini-Attention stores the entire historical context in a fixed-size Memory Matrix instead of a linearly growing KV cache, it can get away with storing 114x fewer parameters into the GPU VRAM when compared to Memorising Transformers. As shown in the table below, for a context length of 65k tokens, Infini-Attention achieves SOTA perplexity scores on benchmarks like PG19 and Arxiv-math while needing to store only 1.6M parameters (size of the Memory Matrix), as opposed to competing architectures.

(Source: Adapted from Munkhdalai et al., table 2)
Infini-attention notably reduces memory footprint while achieving SOTA perplexity on PG19 and Arxiv-math benchmarks

2. The 1 Million Token “Passkey” Test

For a long-context architecture, the needle-in-a-haystack challenge is conventional. The authors tested this by hiding a random passkey in a massive corpus of text and asking the model to retrieve it. As shown in the table below, in a zero-shot setting, the model struggles to find the key, achieving mostly

The authors then fine-tuned the model for 400 steps with sequences that had a length of only 5,000 tokens. Remarkably, the model was able to generalise the fine-tuning to work with sequences up to 1 million tokens long, with drastically improved retrieval accuracy across the board.

(Source: Adapted from Munkhdalai et al., table 3)
The three scores per entry denote the accuracy of retrieval relative to the position of the passkey hidden in the corpus (start/middle/end).

3. State-of-the-Art Book Summarization (500k Context)

Apart from synthetic tests, the authors also tested the model on the BookSum benchmark (Kryściński et al.)5, where the model is required to generate a summary of a long novel. The 8B parameter Infini-Attention model set a new State-of-the-Art performance on the benchmark, by generating successful summaries of books up to 500,000 tokens long.

The results also show a clear trend that the model’s summarisation abilities improve as longer contexts are fed into it. The graph shown below validates this hypothesis, that instead of forgetting previous information (a common failure mode known as “lost-in-the-middle”), the model can effectively use the Memory Matrix to generate accurate summaries.

(Source: Adapted from Munkhdalai et al., figure 4)
Rouge vs input length. Rouge measures how close an AI-generated summary is to a human-written summary based on lexical similarity.

4. Visualising the Gating Scalar

As an additional ablation study, the authors visualised the learnt gating scalar (β) to see how the model was using its new memory. Shown below is the heatmap of the resulting visualisation. The attention heads split into two distinct roles:

  • Specialised Heads: Heads that have a score near 1 or 0, indicating that they choose to focus either on local context (within segment) or global history (previous segments).
  • Mixer Heads: Heads that have scores near 0.5, indicating that their main role is to merge information from both pathways efficiently.

This suggests that the model can learn to switch between short-term/long-term recall and mix information across the entire sequence.

(Source: Adapted from Munkhdalai et al., figure 3)
Visualisation of β reveals that attention heads tend to specialise for either global or local attention under the infini-attention architecture.

5. Conclusion

While it may not fully replace external Vector Databases and RAG systems for reasoning over static knowledge, it does, however, change how models process standard user queries. Integration of such architectures could be the next step forward to let out the research creativity, which earlier had to be bottlenecked by hardware advancements, ultimately accelerating progress in the field of language modelling.

👉If you liked this piece, I share shorter up-to-date writeups on Substack.
👉And if you want to support independent research writing, BuyMeACoffee helps keep it going
.

6. References

  1. Infini-attention (Main Paper): Munkhdalai, T., Faruqui, M., & Gopal, S. (2024). Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention. arXiv preprint arXiv:2404.07143.
  2. Transformer-XL: Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q. V., & Salakhutdinov, R. (2019). Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context. arXiv preprint arXiv:1901.02860.
  3. Memorizing Transformers: Wu, Y., Rabe, M. N., Hutchins, D., & Szegedy, C. (2022). Memorizing Transformers. arXiv preprint arXiv:2203.08913.
  4. Linear Attention (The math foundation): Katharopoulos, A., Vyas, A., Pappas, N., & Fleuret, F. (2020). Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. International Conference on Machine Learning.
  5. BookSum Benchmark: Kryściński, W., Rajani, N., Agarwal, D., Xiong, C., & Radev, D. (2021). BookSum: A Collection of Datasets for Long-form Narrative Summarization. arXiv preprint arXiv:2105.08209.
  6. Standard Attention: Vaswani, Ashish, et al. “Attention is all you need.” Advances in neural information processing systems 30 (2017).
  7. Delta Rule: Schlag, Imanol, Kazuki Irie, and Jürgen Schmidhuber. “Linear transformers are secretly fast weight programmers.” International conference on machine learning. PMLR, 2021.

Source link

#LLMs #Handle #Infinite #Context #Finite #Memory

Tags: artificial intelligenceEditors PickLlmLong-Context Modelsmachine learning
Previous Post

Rocket Report: A new super-heavy launch site in California; 2025 year in review

Next Post

A new CRISPR startup is betting regulators will ease up on gene-editing

AiNEWS2025

AiNEWS2025

Next Post
A new CRISPR startup is betting regulators will ease up on gene-editing

A new CRISPR startup is betting regulators will ease up on gene-editing

Stay Connected test

  • 23.9k Followers
  • 99 Subscribers
  • Trending
  • Comments
  • Latest
A tiny new open source AI model performs as well as powerful big ones

A tiny new open source AI model performs as well as powerful big ones

0
Water Cooler Small Talk: The Birthday Paradox 🎂🎉 | by Maria Mouschoutzi, PhD | Sep, 2024

Water Cooler Small Talk: The Birthday Paradox 🎂🎉 | by Maria Mouschoutzi, PhD | Sep, 2024

0
Ghost of Yōtei: The acclaimed Ghost of Tsushima is getting a sequel

Ghost of Yōtei: The acclaimed Ghost of Tsushima is getting a sequel

0
Best Headphones for Working Out (2024): Bose, Shokz, JLab

Best Headphones for Working Out (2024): Bose, Shokz, JLab

0
Best 5 AI semantic reasoning tools for databases

Best 5 AI semantic reasoning tools for databases

2026-01-10
What new legal challenges mean for the future of US offshore wind

What new legal challenges mean for the future of US offshore wind

2026-01-10
Data Science Spotlight: Selected Problems from Advent of Code 2025

Data Science Spotlight: Selected Problems from Advent of Code 2025

2026-01-10
SpaceX gets FCC permission to launch another 7,500 Starlink satellites

SpaceX gets FCC permission to launch another 7,500 Starlink satellites

2026-01-10

Recent News

Best 5 AI semantic reasoning tools for databases

Best 5 AI semantic reasoning tools for databases

2026-01-10
What new legal challenges mean for the future of US offshore wind

What new legal challenges mean for the future of US offshore wind

2026-01-10
Data Science Spotlight: Selected Problems from Advent of Code 2025

Data Science Spotlight: Selected Problems from Advent of Code 2025

2026-01-10
SpaceX gets FCC permission to launch another 7,500 Starlink satellites

SpaceX gets FCC permission to launch another 7,500 Starlink satellites

2026-01-10
Footer logo

We bring you the best Premium WordPress Themes that perfect for news, magazine, personal blog, etc. Check our landing page for details.

Follow Us

Browse by Category

  • AI & Cloud Computing
  • AI & Cybersecurity
  • AI & Sentiment Analysis
  • AI Applications
  • AI Ethics
  • AI Future Predictions
  • AI in Education
  • AI in Fintech
  • AI in Gaming
  • AI in Healthcare
  • AI in Startups
  • AI Innovations
  • AI News
  • AI Research
  • AI Tools & Automation
  • Apps
  • AR/VR & AI
  • Business
  • Deep Learning
  • Emerging Technologies
  • Entertainment
  • Fashion
  • Food
  • Gadget
  • Gaming
  • Health
  • Lifestyle
  • Machine Learning
  • Mobile
  • Movie
  • Music
  • News
  • Politics
  • Review
  • Robotics & Smart Systems
  • Science
  • Sports
  • Startup
  • Tech
  • Travel
  • World

Recent News

Best 5 AI semantic reasoning tools for databases

Best 5 AI semantic reasoning tools for databases

2026-01-10
What new legal challenges mean for the future of US offshore wind

What new legal challenges mean for the future of US offshore wind

2026-01-10
  • About
  • Advertise
  • Privacy & Policy
  • Contact

© 2026 JNews - Premium WordPress news & magazine theme by Jegtheme.

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In
No Result
View All Result

© 2026 JNews - Premium WordPress news & magazine theme by Jegtheme.