...

A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning


View a PDF of the paper titled Not All Heads Matter: A Head-Degree KV Cache Compression Technique with Built-in Retrieval and Reasoning, by Yu Fu and 5 different authors

View PDF
HTML (experimental)

Summary:Key-Worth (KV) caching is a typical approach to reinforce the computational effectivity of Massive Language Fashions (LLMs), however its reminiscence overhead grows quickly with enter size. Prior work has proven that not all tokens are equally essential for textual content technology, proposing layer-level KV cache compression to selectively retain key info. Recognizing the distinct roles of consideration heads in technology, we suggest HeadKV, a head-level KV cache compression methodology, and HeadKV-R2, which leverages a novel contextual reasoning potential estimation for compression. Our method operates on the degree of particular person heads, estimating their significance for contextual QA duties that require each retrieval and reasoning capabilities. In depth experiments throughout various benchmarks (LongBench, LooGLE), mannequin architectures (e.g., Llama-3-8B-Instruct, Mistral-7B-Instruct), and long-context skills exams exhibit that our head-level KV cache compression considerably outperforms sturdy baselines, notably in low-resource settings (KV measurement = 64 & 128). Notably, our methodology retains simply 1.5% of the KV cache whereas reaching 97% of the efficiency of the complete KV cache on the contextual query answering this http URL can be found at this https URL

Submission historical past

From: Yu Fu [view email]
[v1]
Fri, 25 Oct 2024 02:22:00 UTC (2,319 KB)
[v2]
Mon, 28 Oct 2024 19:32:23 UTC (2,319 KB)
[v3]
Thu, 14 Nov 2024 01:56:11 UTC (2,319 KB)

Source link

#HeadLevel #Cache #Compression #Technique #Built-in #Retrieval #Reasoning