View a PDF of the paper titled Towards Threshold-Free KV Cache Pruning, by Xuanfan Ni and 8 other authors
Abstract:To reduce memory consumption during LLM inference, prior works have proposed numerous methods that focus on KV cache pruning based on various criteria. While these techniques often accomplish lossless memory reduction on many datasets, they often rely on an under-emphasized condition: a dataset/domain-specific budget size threshold needs to be pre-determined to achieve the optimal performance. However, such input-specific tuning may be considerably limited in real-world scenarios, as open-domain inputs span diverse domains, lengths and difficulty levels, without clear boundaries for pre-tuning. Thus, the dependence of an input-sensitive threshold can be an inherent limitation that may cause large degradation on arbitrary inputs. In this work, we propose a new objective that lifts the threshold constraints for robust KV pruning, calling for “threshold-free” methods that automatically adjust budget sizes while ensuring full-cache performance. We then propose a novel method ReFreeKV as the first solution fulfilling this objective, validated by intensive experiments on 13 datasets of diverse context lengths, task types, and model sizes.
Submission history
From: Liyan Xu [view email]
[v1]
Mon, 24 Feb 2025 06:33:39 UTC (252 KB)
[v2]
Mon, 9 Jun 2025 15:31:53 UTC (138 KB)
[v3]
Tue, 6 Jan 2026 14:32:34 UTC (160 KB)
Source link
#ThresholdFree #Cache #Pruning
















![[2502.16886] Towards Threshold-Free KV Cache Pruning [2502.16886] Towards Threshold-Free KV Cache Pruning](https://i0.wp.com/arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png?w=750&resize=750,375&ssl=1)








