[2410.12876] In-context KV-Cache Eviction for LLMs via Attention-Gate

[ad_1]

[Submitted on 15 Oct 2024 (v1), last revised 17 Apr 2025 (this version, v3)]

View a PDF of the paper titled In-context KV-Cache Eviction for LLMs via Attention-Gate, by Zihao Zeng and 4 other authors

View PDF
HTML (experimental)

Abstract:The KV-Cache technique has become the standard for the inference of large language models (LLMs). Yet, it is widely criticized that KV-Cache can become a bottleneck of the LLM inference system. This paper enables a novel dynamic KV-Cache eviction policy by injecting a lightweight module called Attention-Gate to the model. It accepts the global context as input and yields eviction flags for each token. The self-attention modules in the model proceed according to the flags and cache only a subset of the KV states for next token prediction. The Attention-Gates can yield various flags for different heads and layers and be easily tuned on top of a pre-trained LLM via continual pre-training or supervised fine-tuning. The computational and memory overhead introduced by Attention-Gates can be minimal. We empirically evaluate the proposed approach across multiple scenarios, showing that effective eviction of redundant tokens can not only improve efficiency but also enhance performance.

Submission history

From: Zihao Zeng [view email]
[v1]
Tue, 15 Oct 2024 05:01:19 UTC (14,063 KB)
[v2]
Sat, 19 Oct 2024 08:45:11 UTC (14,063 KB)
[v3]
Thu, 17 Apr 2025 03:51:06 UTC (3,982 KB)

Source link

#Incontext #KVCache #Eviction #LLMs #AttentionGate

[ad_2]

[2410.12876] In-context KV-Cache Eviction for LLMs via Attention-Gate

Submission history

Recent Posts

New Google Cloud tool fights future quantum attacks

Western Union to launch stablecoin

“We will never build a sex robot,” says Mustafa Suleyman

Using NumPy to Analyze My Daily Habits (Sleep, Screen Time & Mood)

Mazda shows a rotary hybrid concept for Tokyo with evolved design language

Donald Trump’s Truth Social Is Launching a Polymarket Competitor

Roundtables: Seeking Climate Solutions in Turbulent Times

Withings’ urine scanning health tracker is now available for $350

Google Workspace Promo Code: Up to 14% Off in October 2025

University Denies Monkeys That Escaped in Truck Crash Were Infected With Horrific Diseases