6.1 — Overview of Rolling Buffer KV Cache
In Section 4.4, we discussed incremental inference as an optimisation technique, which utilises a standard KV cache. This works by calculating the Query, Key, and Value matrices for the input sequence once, using them to generate the first token of the output sequence. After this, the Key and Value matrices are cached. When subsequent tokens are generated, the most recently produced token is used to compute a query vector (not a matrix) and corresponding key and value vectors. These new key and value vectors are then appended to the cached Key and Value matrices. This approach enables the model to generate new tokens efficiently, as it only needs to compute a query vector and small updates to the cached Key and Value matrices rather than recalculating the full Query, Key, and Value matrices at every timestep.
Rolling Buffer KV Cache extends this further by taking advantage of the sliding window in Sliding Window Attention. “Rolling Buffer” refers to the Key and Value matrices in the cache only storing information for tokens within the current attention window. As a result, the cache can “forget” tokens outside the local context, significantly reducing memory usage while maintaining the necessary information for accurate token generation. Together, these innovations enable the model to handle long inputs efficiently, making the 32,000-token context length feasible without incurring excessive memory usage.
6.2 —Implementing the Rolling Buffer
Unlike standard KV cache, where the matrices grow in size as each token is predicted, the Rolling Buffer remains at a fixed size throughout inference, which is determined by the attention window. As the window slides forward, the cache updates by replacing the key and value vectors corresponding to tokens that fall outside the current window with those of the new tokens entering the window. This ensures the cache only stores information relevant to the active context, thereby reducing memory usage.
The image below is taken from the Mistral 7B paper and shows the concept of the Rolling Buffer for three example sentences. For the sentence “This is an example of…,” the cache has a window size of 4 tokens. Initially, tokens are appended sequentially: This
, is
, an
, and example
. When the fifth token, of
, is added, the first token, This
, is removed to maintain the window size. The cache continues this rolling process, ensuring that only the most recent 4 tokens are stored at any given time.
6.3 — Pre-filling and Chunking
The Mistral 7B paper also introduces the concepts of pre-filling and chunking, which offer further methods for reducing time and memory usage during inference.
Pre-filling refers to populating the KV Cache with the key and value vectors for all tokens in the input sequence prior to incremental inference. This process ensures that the static portion of the input sequence (e.g., a prompt) is fully processed ahead of time, reducing redundant computation when generating new tokens.
Chunking addresses the challenge of handling long sequence lengths by dividing the input into fixed-length sections called chunks, equal to the window size of the attention mechanism. To prevent memory overload, the Key and Value matrices for each chunk are calculated separately and iteratively added to the cache. Chunking can then be used during inference as well, as more tokens are generated. Tokens in the newest chunk only attend to themselves and the tokens stored in the previous, cached, chunk (as long as they are within the context window). This is illustrated in the image below, which is taken from the Mistral 7B paper.
7.1 — Recap on Activation Functions
Activation functions are essential neural network components found throughout transformer models and allow for the learning of complex patterns in input data. When activations from a previous layer of neurons pass to the next, they are multiplied by weights and summed together to produce weighted sums (denoted z). Since the weighted sums are formed using simple multiplication and addition operations, the process of modifying the input activations is described as a linear transformation. To capture more intricate relationships, non-linear “activation” functions are used to map the z values to a range between 0 and 1 (or -1 and 1 depending on the function).
One of the first widely-used activation functions was the Sigmoid function, which smoothly maps large negative sums to 0 and large positive sums to 1. Its key feature is that small changes in the input around the midpoint (near 0) result in small, smooth changes in the output, which helps stabilise the learning process.
7.2 — Rectified Linear Unit (ReLU)
Despite its initial popularity, the Sigmoid activation function suffers from a few issues, chief among these being the vanishing gradient problem we discussed in Section 2.2. The Rectified Linear Unit (ReLU) was proposed to address these limitations in the 1975 paper, “Cognitron: A Self-Organizing Multilayered Neural Network” by Kunihiko Fukushima [18].
The ReLU activation function simplifies the computation by setting the output to zero for negative input values (z<0) and mapping positive input values linearly (z for z>0). Unlike Sigmoid, ReLU avoids saturation for highly positive inputs, maintaining sensitivity to changes and allowing more efficient learning in deep networks.
Note: Saturation describes an activation function that produces outputs that are nearly constant regardless of input changes, leading to diminished gradients and hindering effective weight updates. ReLU’s linear behaviour for positive values prevents this problem.
7.3 — Gated Linear Unit (GLU)
Gated Linear Units (GLUs) were introduced in 2017 by Dauphin et al. in the paper “Language Modeling with Gated Convolutional Networks” [19]. While ReLU activation functions remain widely used in modern neural network architectures, GLUs have become increasingly popular in language modelling tasks due to their ability to better capture complex linguistic patterns and relationships.
A key feature of GLUs is the gating mechanism inside each unit, which dynamically adjusts the activation outputs. This mechanism involves an additional learned gate, expressed mathematically as z1 ⋅ σ(z2), where z1 is the main input and z2 acts as the gate. The second input z2, which is passed through a sigmoid activation function σ(z2), controls the flow of information, providing a mechanism for selective activation. This two-input design distinguishes GLUs from ReLU, offering a more nuanced activation function that helps mitigate the risk of neurons becoming permanently inactive (a common problem with ReLU). We won’t dive into the intricacies here, but if you are interested in learning more about GLUs, I encourage you to read the original paper.
7.4 — Swish Gated Linear Unit (SwiGLU)
The Swish Gated Linear Unit (SwiGLU) was proposed as an improvement to the regular Gated Linear Unit (GLU) and debuted in Google Research’s 2022 paper, “PaLM: Scaling Language Modeling with Pathways,” alongside the PaLM model [20]. By combining the Swish activation function (expressed as z ⋅ σ(z)) with GLU’s gating mechanism, SwiGLU offers greater expressiveness and better capacity to model complex relationships in data, making it particularly effective in language modelling tasks. Note the difference between the Swish and GLU functions: Swish is a single-input function, not a two-input function like in GLUs.
Mistral 7B utilises the SwiGLU activation function in its feedforward sub-layers, enhancing its ability to extract meaningful patterns from training data and improving performance during inference. This refinement contributes to Mistral 7B’s effectiveness in handling intricate linguistic structures and large context windows.
With the release of Mistral 7B, Mistral AI entered the LLM space at a time when model size was the main factor driving performance. Rather than following the trend of ever-larger models, Mistral AI distinguished themselves by emphasising innovative, memory-efficient designs that deliver impressive results with a fraction of the parameters. The success of Mistral 7B demonstrated that strong performance doesn’t always require enormous models, and that strategic design choices can enable smaller models to be comparable with, or even outperform, their larger counterparts.
Building on this approach, Mistral continues to push the boundaries of efficiency and performance, expanding into areas such as Mixture of Experts with Mixtral 8x7B, language-vision models with Pixtral, and even the mobile space with Mistral 3B. As the company progresses, it will be interesting to see how they continue to push the art forward for smaller models.
[1] Jiang, Albert Q., et al., Mistral 7B (2023), arXiv preprint arXiv:2310.06825.
[2] Hugging Face, Mistral AI (2024), HuggingFace.co
[3] Hendrycks, D. et al., Measuring massive multitask language understanding (2020), arXiv preprint arXiv:2009.03300
[4] Zhong, W., et al., AGIEval: A human-centric benchmark for evaluating foundation models (2023), arXiv preprint arXiv:2304.06364
[5] Suzgun, M., et al., Challenging big-bench tasks and whether chain-of-thought can solve them (2022) arXiv preprint arXiv:2210.09261.
[6] Ba, J., et al., Layer Normalization (2016) arXiv preprint arXiv:1607.06450.
[7] Zhang, B., and Sennrich, R., RMS Normalization (2019) preprint arXiv:1910.07467.
[8] Shaw, P., et al., Self-Attention with Relative Position Representations (2018) arXiv:1803.02155.
[9] Dai, Z., et al., Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context (2019) arXiv:1901.02860.
[10] Raffel, C., et al., Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (2019) arXiv:1910.10683.
[11] Su, J., et al., ROFORMER: ENHANCED TRANSFORMER WITH ROTARY POSITION EMBEDDING (2023) arXiv:2104.09864
[12] Hugging Face, Modeling Llama (2024). GitHub
[13] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, Attention is All You Need (2017), Advances in Neural Information Processing Systems 30 (NIPS 2017)
[14] Shazeer, N., Fast Transformer Decoding: One Write-Head is All You Need (2019) arXiv:1911.02150
[15] Ainslie, J., et al., GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints (2023) arXiv:2305.13245
[16] Raffel, C., et al., Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (2023) arXiv:1910.10683
[17] Beltagy, I., et al., Longformer: The Long-Document Transformer (2020) arXiv:2004.05150
[18] https://link.springer.com/article/10.1007/BF00342633
[19] Dauphin, Y. N., et al., Language Modeling with Gated Convolutional Networks (2017) arXiv:1612.08083
[20] Chowdhery, A., et al, PaLM: Scaling Language Modeling with Pathways (2022) arXiv:2204.02311
Source link
#Mistral #Explained #Efficient #Language #Models #Bradney #Smith #Nov