View a PDF of the paper titled Breadcrumbs Reasoning: Memory-Efficient Reasoning with Compression Beacons, by Giovanni Monea and 4 other authors
Abstract:The scalability of large language models for long-context reasoning is severely constrained by the linear growth of their Transformer key-value cache, which incurs significant memory and computational costs. We posit that as a model generates reasoning tokens, the informational value of past generated tokens diminishes, creating an opportunity for compression. In this work, we propose to periodically compress the generation KV cache with a learned, special-purpose token and evict compressed entries. We train the model to perform this compression via a modified joint distillation and reinforcement learning (RL) framework. Our training method minimizes overhead over the conventional RL process, as it leverages RL outputs for distillation. Empirically, our method achieves a superior memory-accuracy Pareto frontier compared to both the model without cache compression and training-free compression techniques.
Submission history
From: Giovanni Monea [view email]
[v1]
Wed, 15 Oct 2025 17:57:21 UTC (488 KB)
[v2]
Mon, 10 Nov 2025 00:06:46 UTC (488 KB)
[v3]
Mon, 29 Dec 2025 13:06:57 UTC (557 KB)
Source link
#MemoryEfficient #Reasoning #Compression #Beacons
























