...

[2410.12876] In-context KV-Cache Eviction for LLMs via Attention-Gate

[ad_1]

View a PDF of the paper titled In-context KV-Cache Eviction for LLMs via Attention-Gate, by Zihao Zeng and 4 other authors

View PDF
HTML (experimental)

Abstract:The KV-Cache technique has become the standard for the inference of large language models (LLMs). Yet, it is widely criticized that KV-Cache can become a bottleneck of the LLM inference system. This paper enables a novel dynamic KV-Cache eviction policy by injecting a lightweight module called Attention-Gate to the model. It accepts the global context as input and yields eviction flags for each token. The self-attention modules in the model proceed according to the flags and cache only a subset of the KV states for next token prediction. The Attention-Gates can yield various flags for different heads and layers and be easily tuned on top of a pre-trained LLM via continual pre-training or supervised fine-tuning. The computational and memory overhead introduced by Attention-Gates can be minimal. We empirically evaluate the proposed approach across multiple scenarios, showing that effective eviction of redundant tokens can not only improve efficiency but also enhance performance.

Submission history

From: Zihao Zeng [view email]
[v1]
Tue, 15 Oct 2024 05:01:19 UTC (14,063 KB)
[v2]
Sat, 19 Oct 2024 08:45:11 UTC (14,063 KB)
[v3]
Thu, 17 Apr 2025 03:51:06 UTC (3,982 KB)

Source link

#Incontext #KVCache #Eviction #LLMs #AttentionGate

[ad_2]