Scaling Recommender Transformers to a Billion Parameters

! My name is Kirill Khrylchenko, and I lead the RecSys R&D team at Yandex. One of our goals is to develop transformer technologies within the context of recommender systems, an objective we’ve been pursuing for five years now. Not too long ago, we reached a new milestone in the development of recommendation technologies, which I would like to share with you in this article.

The relevance of recommender systems in the world is easy to justify: the amount of content is growing incredibly fast, making it impossible to view in its entirety, and we need recommender systems to address the information overload. Music, movies, books, products, videos, posts, friends, but it’s important to remember that these services benefit not only users but also content creators who need to find their target audience.

We’ve deployed a new generation of transformer recommenders in several services and are actively integrating them with other services. We’ve significantly improved the quality of the recommendations across the board.

If you’re an ML engineer working with recommendations, this article will provide you with some ideas on how to implement a similar approach for your recommender system. And if you are a user, you have an opportunity to learn more about how that very recommender system works.

How Recommender Systems Work

The recommendation problem itself has a simple mathematical definition: for each user u∈U

we want to select items, objects, documents, or products

that they are likely to like.

But there’s a catch:

Item catalogs are vast (up to billions of items).
There is a significant number of users, and their interests are constantly shifting.
Interactions between users and items are very sparse.
It is unclear how to define actual user preferences.

To tackle the recommendation problem effectively, we need to leverage non-trivial models that use machine learning.

Neural networks are a potent machine learning tool, especially when there’s a large amount of unstructured data, such as text or images. While traditional classical machine learning involves expert domain knowledge and considerable manual work (feature engineering), neural networks can extract complex relationships and patterns from raw data almost automatically.

In the RecSys domain, we have a large amount of mostly unstructured data (literally trillions of anonymized user-item interactions), as well as entities that are content-based (items consist of titles, descriptions, images, videos, and audio; users can be represented as sequences of events). Additionally, it is crucial that the recommender system performs well for new items and cold users, and encoding users and items through content helps achieve this.

The time we have to generate recommendations for the user is very strictly limited. Every millisecond counts! Additionally, we don’t have infinite resources (in terms of hardware), and the catalogs we need recommendations from are quite large. This is why recommendations are usually formed in multiple stages:

First, we select a relatively small set of candidates from the entire catalog using lightweight models (retrieval stage).
Then, we run these candidates through more complex models that utilize additional information and more intensive computations for each candidate (ranking stage).

Architecturally, models vary significantly between stages, making it challenging to discuss any aspect without referring to specific stages of the recommender system.

Multi-stage recommender systems, Image by Author

The two-tower neural network architecture is very popular for the retrieval stage. Users and items (for information retrieval, this would be queries and documents, independently encoded into vector representations) are used, and the dot product is employed to calculate the similarity between them.

You could also say that such models “embed” users and items into a shared “semantic space”, where the semantic aspect represents the fact that the closer the user-item pair is in terms of vector space, the more similar they are.

Two-tower models are high-speed. Let’s assume the user requests recommendations. The two-tower model then needs to calculate:

The “user tower” once per request.
Vectors of all candidate items for which you want to calculate user-item affinity.
Dot products.

You don’t even need to recalculate the vectors of candidate items for each user query, because they are the same for all users and rarely change; for instance, we don’t assume that a movie or a music track often changes its title. In practice, we regularly recalculate item vectors for the entire catalog offline (for example, daily) and upload them to either the service where we need to calculate the dot product or to another service that we access online to retrieve the necessary item vectors.

But that’s me describing a use case where you have some reasonable, small number of candidates you want to calculate user-item affinities for. This is true for the ranking stage. However, at the candidate generation stage, the problem becomes more complicated: we need to calculate proximities for all items in the catalog, select the top-N (where N is typically expressed in hundreds to thousands) with the highest affinity values, and then forward them to the subsequent stages.

This is where two-tower models are invaluable: we can quickly generate an approximate top-N by scalar product, even for massive catalogs, using approximate search methods. We build a specific “index” (typically a graph structure, such as in the HNSW method) for the set of already calculated item vectors that we can store in the service and use to feed user vectors, extracting an approximate top for these vectors.

Building this index is difficult and time-consuming (with a separate challenge of quickly updating and rebuilding an index). With that being said, it can still be done offline, and then the binary and the index can be uploaded to the service, where we’ll search for candidates in the runtime environment.

Two-tower neural network, Image by Author

How Do We Encode a User Into a Vector?

Classical algorithms solved this problem quite easily: in matrix factorization methods (like ALS), the user vector was “trainable”, represented by the model parameters, and determined within the optimization procedure. In user-item collaborative filtering methods, a user was assigned a vector of catalog dimensionality in which the i-th coordinate corresponded to a particular item and represented how often the user interacted with that item (e.g., how frequently they bought it or how they rated it).

The modern approach would be to encode users with transformers, suggesting that a user can be encoded into a vector using transformers. We take the user’s anonymized history—that is, a sequence of events—and encode these events into vectors, then utilize a transformer. In the most basic case, events are represented by purchases or likes; however, in other cases, it could be the entire history of interactions within a company’s ecosystem.

Initially, when transformers were first used in recommendations, researchers drew analogies from similarities with NLP: a user is like a sentence, and the words in it represent purchases, likes, and other interactions.

Two-tower neural network design with a transformer Image by Author

Another type of neural network recommender model is models with early fusion. These models don’t separate user and item information into two towers but rather process all information together. That is, we fuse all information about the user, the item, and their interaction at an early stage. In contrast, two-tower models are said to feature late fusion through the scalar product. Early-fusion models are more expressive than two-tower models. They can capture more complex signals and learn more non-trivial dependencies.

However, it’s difficult to apply them outside the ranking stage because of their computational burden and the need to recalculate the entire model for each user query and each candidate. Unlike two-tower models, they don’t support the factorization of computations.

We utilize various architecture types, including two-tower models with transformers and models with early fusion. We use two-tower architectures more often because they are highly efficient, suitable for all stages simultaneously, and still yield good quality gains with considerably fewer resources.

We used to train two-tower models in two stages:

Pre-training with contrastive learning. We train the model to align users with their positive user-item interactions using contrastive learning,
Task-specific fine-tuning. As with NLP, fine-tuning is a task-specific approach. If the model will be used for ranking, we train it to accurately rank the recommendations shown to the user. We showed two items—the user liked one, disliked the other—we want to rank items in the same order. With retrieval, the task resembles pre-training but employs additional techniques that enhance the candidates’ recall.

In the next section, we’ll explore how this process has changed with our newer models.

Scaling Recommender Systems

Is there a limit to the size of recommender models, after which we no longer see size-related improvements in the quality of recommendations?

For a long time, our recommender models (and not just ours, but models across industry and academia) were very small, which suggested that the answer to this question was “yes”.

However, in deep learning, there is the scaling hypothesis, which states that as models become larger and the volume of data increases, the model quality should improve significantly.

Much of the progress in deep learning over the past decade can be attributed to this hypothesis. Even the earliest successes in deep learning were based on scaling, with the emergence of an extensive dataset for image classification, ImageNet, and the good performance of neural networks (AlexNet) on that dataset.

The scaling hypothesis is even more evident in language models and natural language processing (NLP): you can predict the dependence of quality improvement on the amount of computations and express the corresponding scaling laws.

Dashboard parameter overview. Image by Author

What do I mean when I say recommender models can be made bigger?

There are as many as four different axes to scale.

Embeddings. We have a variety of information about users and items, so we have access to a wide range of features, and a large portion of these features are categorical. An example of a categorical feature is Item ID, artist ID, genre, or language.

Categorical features have a very high cardinality (number of unique values)—reaching billions—so if you make large trainable embeddings (vector representations) for them, you get huge embedding matrices.

That said, embeddings are the bottleneck between the input data and the model, so you need to make them large for good quality. For example, Meta* has embedding matrices with dimensions ranging from 675 billion to 13 trillion parameters, while Google reported at least 1 billion parameters in YoutubeDNN back in 2016. Even Pinterest, which had long promoted inductive graph embeddings from PinSage [1, 2], has recently started using large embedding matrices.

Context length. For decades, recommender system engineers have been busy generating features. In modern ranking systems, the number of features can reach hundreds or even thousands, and Yandex also offers such services.

Another example of “context” in a model is the user’s history in a transformer. Here, the size of the context is determined by the length of the history. In both industry and academia, the number tends to be very small, with only a few hundred events at best.

Training dataset size. I already mentioned that we have a lot of data. Recommender systems produce hundreds of datasets that are similar in size to the GPT-3 training dataset.

The industry has multiple use cases of massive datasets with billions of training examples on display: 2 billion, 2.1 billion, 3 billion, 60 billion, 100 billion, 146 billion, 500 billion.

Encoder size. The standard for early-fusion models will be in millions or tens of millions of parameters. According to the Google papers, “simplified” versions of their Wide&Deep models had 1 to 68 million parameters for the experiments [1, 2]. And if we use a two-layer DCN-v2 (a popular neural network layer for early-fusion models) over a thousand continuous features, we’ll get no more than 10 million parameters.

Two-tower models most often use tiny transformers to encode the user: for example, two transformer blocks with hidden layer dimensionality not exceeding a couple of hundred. This configuration will have at most a few million parameters.

And while the sizes of the embedding matrices and training datasets are already quite large, scaling the length of user history and the capacity of the encoder part of the model remains an open question. Is there any significant scaling by these parameters or not?

This was the question on our minds in February, 2024. Then an article from researchers at Meta, titled Actions Speak Louder than Words, cheered us all up a bit.

The аuthors presented a new encoder architecture called HSTU and formulated both the ranking problem and the candidate generation problem as a generative model. The model had a very long history length (8000 events!) along with an extensive training dataset (100 billion examples), and the user history encoder was much larger than the previous few million parameters. However, even here, the largest encoder configuration mentioned, has only 176 million parameters, and it’s unclear whether they implemented it (judging by the subsequent articles, they didn’t).

Are 176 million parameters in an encoder a lot or a little? If we look at language models, the answer is clear: an LLM with 176 million parameters in the encoder will be highly inferior in capability and problem-solving quality to modern SOTA models with billions or even trillions of parameters.

Why, then, do we have such small models in recommender systems?

Why can’t we achieve a similar leap in quality if we replace natural language texts with anonymized user histories in which actions act as words? Have recommender models already reached the ceiling of their baseline quality, and all we have left is to make small incremental improvements, tweaking features and target values.

These were the existential questions we asked ourselves when designing our own new ARGUS approach.

RecSys × LLM × RL

After plowing through the extensive literature on scaling, we found that three main conditions determine the success of neural network scaling:

Lots of data.
Quite expressive architecture with a large model capacity.
The most general, fundamental learning task possible.

For example, LLMs are very expressive and powerful transformers that learn from literally all the data on the internet. Additionally, the task of predicting the next word is a fundamental task that, in reality, decomposes into various tasks related to different fields, including grammar, erudition, mathematics, physics, and programming. All three conditions are met!

If we look at recommender systems:

We also have a lot of data: trillions of interactions between users and items.
We can just as easily use transformers.
We just need to find the right learning task to scale the recommender model.

That’s what we did.

There’s an interesting aspect of pre-training large language models. If you just ask a pre-trained LLM about something, it will give an average answer. The most likely answer it has encountered in the training data. That answer won’t necessarily be good or right.

But if you add a prompt before the question, like “Imagine you are an expert in X”, it will start providing much more relevant and correct answers.

That’s because LLMs don’t just learn to imitate answers from the internet; they also acquire a more fundamental understanding of the world in an attempt to condense all the information from the training set. It learns patterns and abstractions. And it’s precisely because the LLM knows a wide range of answers and yet possesses a fundamental understanding of the world that we can obtain good answers from it.

Venn Diagram : What Makes for a Good Answer? Image by Author

We tried to apply this logic to recommender systems. First, you need to express the recommendations as a reinforcement learning task:

A recommender system is an agent.
Actions are recommendations. In the most basic case, the recommender system recommends one item at a time (for example, recommends one music track in the music streaming app each time).
The environment means users, their behaviors, patterns, preferences, and interests.
The policy is a probability distribution over items.
The reward is a user’s positive feedback in response to a recommendation.

Recommendations as a Reinforcement Learning Task, Image by Author

There’s a direct analogy to the LLM example. “Answers from the internet” are the actions of past recommender systems (logging policies), and fundamental knowledge about the world is understanding users, their patterns, and preferences. We want our new model to be able to:

Imitate the actions of past recommender systems.
Have a good understanding of the users.
Adjust their actions to achieve a better outcome.

Before we move on to our new approach, let’s examine the most popular setup for training recommendation transformers: next—item prediction. The SASRec model is very representative here. The system accumulates a user’s history of positive interactions with the service (for example, purchases), and the model learns to predict which purchase is likely to come next in the sequence. That is, instead of next-token prediction, as in NLP, we opt for next-item prediction.

Self-Attentive Sequential Recommendation. Source

This approach (SASRec and common next item prediction) is not consistent with the philosophy I described earlier, which focused on adjusting the logging policy based on fundamental knowledge of the world. It would seem that to predict what the user will buy next, the model should operate under this philosophy:

It should understand what could be shown to the user by the recommender system that was in production at the time for which the prediction should be made. That is, it should have a good model of logging policy behavior (i.e., a model that can be used to imitate).
It needs to understand what the user might have liked from the things shown by the past recommender system, meaning that it needs to understand their preferences, which are the very fundamental beliefs about the world.

But models like SASRec don’t explicitly model any of these things. They lack complete information about past logging policies (we only see recommendations with positive outcomes), and we also don’t learn how to replicate these logging policies. There’s no way to know what the past recommender system could have offered. At the same time, we don’t fully understand the model of the world or the user: we ignore all negative feedback and only consider positive feedback.

ARGUS: AutoRegressive Generative User Sequential Modeling

AutoRegressive Generative User Sequential modeling (ARGUS) is our new approach to training recommendation transformers.

First, we examine the entire anonymized user history, including positive interactions but also all other interactions. We capture the essence of the interaction context, the time it occurred, the device used, the product page the user was on, their My Vibe personalization settings, and other relevant details.

ARGUS: AutoRegressive Generative User Sequential Modeling

User history is a specific sequence of triples (context, item, feedback), where context refers to the interaction context, item represents the object the user interacts with, and feedback denotes the user’s reaction to the interaction (such as whether the user liked the item, bought it, etc.).

Next, we identify two new learning tasks, both of which extend beyond the conventional next-item prediction widely used in industry and academia.

Next item prediction

Our first task is also called next item prediction. Looking at the history and the current interaction context, we predict which item will be interacted with: P(item | history, context).

If the history contains only recommendation traffic (events generated directly by the recommender system), then the model learns to imitate the logging policy (recommendations from the past recommender system).

If there is also organic traffic (any traffic other than referral traffic, such as traffic from search engines, or if the user visits the library and listens to their favorite track), we also gain more fundamental knowledge about the user, unrelated to the logging policy.

Important: though this task has the same name as in SASRec (next item prediction), it’s not the same task at all. We predict not only positive but also negative interactions, and also take into account the current context. The context helps us understand whether the action is organic or not, and if it’s a recommendation, what surface it’s on (place, page, or carousel). Also, it generally reduces the noise level during model training.

Context is essential for music recommendations: the user’s mood and their current situation have a significant impact on the type of music they want to listen to.

The task of predicting an element from a set is typically expressed as a classification problem, where the elements of the original set serve as classes. Then, we need to use a cross-entropy loss function for training, where the softmax function is applied to the logits (unnormalized outputs of the neural network). Softmax calculation requires computing the sum of exponents from logits across all classes.

While the size of dictionaries in LLMs can reach hundreds of thousands of items in the worst case, and softmax calculation is not a significant problem, it becomes a concern in recommender systems. Here, catalogs consist of millions or even billions of items, and calculating the full softmax is an impossible task. This is a topic for a separate big article, but eventually, we have to use a tricky loss function called “sampled softmax” with a logQ correction:

N means a mix of in-batch and uniform negatives
logQ(n)means logQ correction
Temperature Tmeans a trained parameter Eᵀclipped to [0.01, 100].

Feedback prediction

Feedback prediction is the second learning task. Considering history, the current context, and the item, we predict user feedback: P(feedback | history, context, item).

The first task, next item prediction, teaches us how to imitate logging policies (and understanding users if there is organic traffic). The feedback prediction task, on the other hand, is focused exclusively, on getting fundamental knowledge about users, their preferences, and interests.

It is very similar to how the ranking variant of the model from “Actions Speak Louder than Words” learns on a sequence of pairs (item, action). Still, here the context token is treated separately, and there are more than just recommender contexts present.

Feedback can have multiple components: whether a track was liked, disliked, added to a playlist, and what portion of the track was listened to. We predict all types of feedback by decomposing them into individual loss functions. You can use any loss function as a specific loss function, including cross-entropy or regression. For example, binary cross-entropy is sufficient to predict whether a like was present or not.

Although some feedback is more common (there are usually far fewer likes than long listens), the model does a good job of learning to predict all signals. The larger the model, the easier it is to learn all tasks at once, without conflicts. Moreover, frequent feedback (listens), on the contrary, helps the model learn how to simulate rare, sparse feedback (likes).

Diagram illustrating how the transformer model performs next-item and feedback prediction. Image by Author

If we combine all this into a single learning task, we get the following:

Create histories for the user from triples (context, item, feedback).
Use the transformer.
Predict the next item based on the hidden state of the context.
Predict the user’s feedback after interacting with the item based on the item’s hidden state.

The image illustrates the difference between the ARGUS and SASRec approaches: with ARGUS, we train the model to imitate the behavior of past recommender systems and predict the user’s response; in contrast, with SASRec, we train the model to predict the next positive interaction.

Let me also comment on how this differs from HSTU. In Actions Speak Louder than Words, the authors train two separate models for candidate generation and ranking. The candidate generation model contains the entire history, but, like SASRec, it models only positive interactions and doesn’t consider the loss function in cases where there’s a negative interaction. The ranking model, as mentioned earlier, learns for a task similar to our feedback prediction.

Our solution offers a more comprehensive next item prediction task and a more comprehensive feedback prediction task, and the model learns in both functions simultaneously.

Simplified ARGUS

Our approach has one big problem—we’re inflating the user’s history. Because each interaction with an item is represented by three tokens at once (context, item, feedback), we would have to feed almost 25,000 tokens into the transformer to analyze 8192 recent user listens.

One could argue that this is still not significant and that the context length is much longer in LLMs; however, this is not entirely accurate. LLMs, on average, have much smaller numbers, typically hundreds of tokens, especially during pre-training.

In contrast, in our music streaming platform, for example, users often have thousands or even tens of thousands of events. We already have much longer context lengths, and inflating those lengths by a factor of three has an even greater impact on learning speed. To tackle this, we created a simplified version of the model, in which each triple (context, item, feedback) is condensed into a single vector. In terms of input format, it resembles our previous generations of transformer models; however, we maintain the same two learning tasks—next item prediction and feedback prediction.

To predict the next item, we take the hidden state from the transformer corresponding to the triple (c, i, f) at a past point in time, concatenate the current context vector to it, compress it to a lower dimension using an MLP, and then use the sampled softmax to learn to predict the next item.

To predict the feedback, we concatenate the vector of the current item and then use an MLP to predict all the required target variables. In terms of recommender transformer architectures, our model becomes less target-aware and less context-aware; however, it still performs well, enabling a three-fold acceleration.

ARGUS Implementation

A model trained in this two-headed mode for both tasks simultaneously (next item prediction and feedback prediction) can be implemented as is. The NIP head is responsible for candidate selection, and the FP head for final ranking.
But we didn’t want to do that, at least not for our first implementation:

Our goal was to implement a very large model, so we initially focused on offline deployment. With offline deployment, user and item vectors are recalculated daily within a separate regular process, and you only need to calculate the dot product in the runtime environment.

The pre-trained version of ARGUS implies access to the user’s history without any delay: we see all events in their history up to the current point in time when the prediction is made. That is, it needs to be applied at runtime.
The NIP head predicts all user interactions, and the model is usually trained to predict only future positive interactions to generate candidates. But predicting positive interactions is a heuristic, a surrogate learning task. It might even be better to use a head that predicts all interactions because it learns to be consistent with the ranking. If an item has been recommended, it means the ranking liked it. But in this situation, we weren’t ready to experiment with that and instead wanted to follow the well-trodden path.
The FP head learns for pointwise losses: whether a track will be liked or not, what portion of the track will be heard, and so on. But we still often train models for pairwise ranking: we learn to rank items that were recommended “next to each other” and received different feedback. Some argue that pointwise losses are sufficient for training ranking models, but in this case, we don’t replace the entire ranking stack. Instead, we aim to add a new, powerful, neural-network-based feature to the final ranking model. If the final ranking model is trained for a particular task (such as pairwise ranking), then the neural network that generates the feature is most efficiently trained for that task; otherwise, the final model will rely less on our feature. Accordingly, we’d like to pre-train ARGUS for the same task as the original ranking model, allowing us to utilize it in ranking.

There are other deployment use cases beyond the conventional candidate generation and ranking stages, and we’re actively researching these as well. However, for our first deployment, we went with an offline two-tower ranking:

We decided to fine-tune ARGUS so that it could be used as an offline two-tower model. We use it to recalculate user and item vectors daily, while user preferences are determined through the dot product of the user with the items.

We pre-trained ARGUS for a pairwise ranking task similar to the one on which the final ranking model is trained. This means that we have somehow selected pairs of tracks that the user heard and rated differently in terms of positive feedback, and we want to learn how to rank them correctly.

We build these models quite often: they are easy to train and implement in terms of resources and development costs. However, our previous models were significantly smaller and learned differently. Not with the ARGUS procedure, but first with the usual contrastive learning between users and positives, and then fine-tuned for the task.

Our previous contrastive pre-training procedure implied compiling multiple training examples for a user: if the user had n purchases, then there would be n samples in the dataset. That said, we didn’t use autoregressive learning. That is, we ran the transformer n times during training. This approach enabled us to be very flexible in creating pairs (user, item) for training, use any history format, encode context together with the user, and account for lags. When predicting likes, we can use a one-day lag in the user’s history. However, things were running pretty slowly.

ARGUS pre-training employs autoregressive learning, where we learn from all events in the user’s activity simultaneously in a single transformer run. This is a powerful acceleration that allowed us to train much larger models using the same resources.

During fine-tuning, we also ran the transformer many times for a single user. It is called impression-level learning that Meta used to have before HSTU. If a user is shown an item at a specific moment, we generate a sample of the form (user, item). The dataset can contain a large number of such impressions for a single user, and we will rerun the transformer for each one of them. For pairwise ranking, we considered triples of the form (user, item1, item2). The ones we used before.

Examining the acceleration during the pre-training stage, we decided to employ a similar approach with fine-tuning. We develop a fine-tuning procedure for the two-tower model to teach it ranking, where the transformer only needs to be run once.

Diagram of how transformers use historical impressions and user states to form predictions. Image by Author

Let’s say we have the user’s entire history for a year, and all the recommendations shown to the user within the same period. By implementing a transformer with a causal mask over the entire history, we get vector representations of the user for all the moments in that year at once, and so we can:

Separately calculate the vectors of the shown items.
Review the timestamps and map recommendation impressions to user vectors corresponding to the required lag in user history delivery.
Calculate all the required scalar products and all terms of the loss function.

And all of this at once for the entire year—in a single transformer run.

Previously, we would rerun the transformer for each pair of impressions; now, we process all the impressions at once in a single run. This is a massive acceleration: by a factor of tens, hundreds, or even thousands. To employ a two-tower model like this, we can simply use the vector representation of the user at the last moment in time (corresponding to the last event in the history) as the current vector representation. For the items, we can use the encoder that was used during training for the impressions. In training, we simulate a one-day user history lag and then run the model as an offline model, recalculating user vectors daily.

When I say that we process the user’s entire year of history in a single transformer run, I’m being somewhat misleading. In reality, we have a certain limit on the maximum history length that we enforce, and a user in a dataset can have multiple samples or chunks. For pre-training, these chunks don’t overlap.

However, during fine-tuning, there are limits not only on the maximum history length but also on its minimum length, as well as on the maximum number of recommendation impressions in a single training example used to train the model for ranking.

Results

We chose our music streaming as the first service to experiment with. Recommendations are crucial here, and the service has a large number of active users. We’ve built a giant training dataset with over 300 billion listens from millions of users. This is tens or even hundreds of times larger than the training datasets we’d used before.

What’s a triple (context, item, feedback) in a music streaming service?

Context: whether the current interaction is a recommendation or organic. If it’s a recommendation—what surface it’s on, and if it’s My Vibe—what the settings are.
Item: a music track. The most important feature for item encoding is the item ID. We use unified embeddings to encode features with high cardinality. In this case, we take three 512K hashes per item. We use a fixed unified embedding matrix with 130 million parameters in our experiments.
User feedback: whether a track was liked, and what portion of the track was heard.

For offline quality assessment, we use data from the week following the training period through the global temporal split.

To assess the quality of the pre-trained model, we examine the loss function values in the pre-training tasks: next item prediction and feedback prediction. That is, we measure how well the model learned to solve the tasks we created for it. The smaller the value, the better.

Important: We consider the user’s history over a long period, but the loss function is only calculated for events that occur within the test period.

During fine-tuning, we learn to correctly rank item pairs based on user feedback, making PairAccuracy— a metric that measures the share of pairs correctly ordered by the model —a suitable offline metric for us. In practice, we reweigh pairs slightly more based on feedback: for example, pairs in which the person liked and skipped a track have a higher weight than those in which the person listened to and skipped a track.

Our deployment scenario involves adding a powerful new feature to the final ranker. For this reason, we measure the relative increase in PairAccuracy for the final ranker with the new feature added, compared to the final ranker without it. The final ranker in our music streaming platform is gradient boosting.

A/B Test Results and Measurements

Our initial goal was to scale recommendation transformers. To test the scaling, we selected four different-sized transformer configurations, ranging from 3.2 million to 1.007 billion parameters.

We also decided to test the performance of the HSTU architecture. In “Actions Speak Louder than Words“, the authors proposed a new encoder architecture, which is quite different from the transformer architecture. Based on the authors’ experiments, this architecture outperforms transformers in recommendation tasks.

Performance test dashboard. Image by Author

There’s scaling! Each new jump in architecture size results in a quality gain, both in pre-training and fine-tuning.

HSTU proved to be no better than transformers. We used the largest configuration mentioned by the authors in “Actions Speak Louder than Words.” It has one and a half times more parameters than our medium transformer, while having roughly the same quality.

Graph describing the relationship between model size, entropy prediction, and ranking uplift. Image by Author.

Let’s visualize the metrics from the table as a graph. In that case, we can observe the scaling law for our four points: the dependence of quality on the logarithm of the number of parameters appears linear.

We performed a small ablation study to find out whether we could simplify our model or remove any parts from the training.

Results with pre-training vs without, Image by Author

If you remove pre-training, the model’s quality drops.

Fine-tuning and pairwise accuracy results, Image by Author

If you reduce the duration of fine-tuning, the drop becomes even more pronounced.

Noticeable scaling in history length, Image by Author

At the beginning of this article, I mentioned that the authors of “Actions Speak Louder than Words” trained a model with a history length of 8,000 items. We decided to give it a try: it turns out that handling such a deep user’s musical history results in a noticeable improvement in recommendations. Previously, our models utilized a maximum of 1,500–2,000 events. This was the first time we had the opportunity to cross this threshold.

Implementation Results

We’ve been working to develop transformers for music recommendations for about three years now and we’ve come a long way. Here’s everything we have learned and how we have progressed developing transformer-based models for music recommendations over this time.

Our first three transformers were all offline. User and item vectors were recalculated daily. Then, user vectors were loaded into a key-value store, and item vectors were stored in the service’s RAM, while only the dot product was calculated at runtime. We utilized some of these models not only for ranking, but also for candidate generation (we are familiar with building multi-head models that perform both tasks). In cases like this, the HNSW index, from which candidates can be retrieved, also resides in the service’s RAM.
The first model only had a signal about likes, the second model had a signal about listens (including skips), and in the third model, we combined both signal types (explicit and implicit).
The v4 version of the model is an adaptation of v3, which is implemented in runtime with a slight lag in user history, its encoder is 6x smaller than that of the v3 model.
The new ARGUS model has eight times the user history length and ten times the encoder size. It also uses a new learning process I described earlier.

Implementation version dashboard, Image by Author

TLT is the total listening time. The “like” likelihood represents the chances of a user liking a recommendation when it’s shown to them. Each implementation resulted in a metrics boost for our user-tailored recommendations. And the first ARGUS gave about the same increase in metrics as all the previous implementations combined!

ARGUS Test Results Dashboard, Image by Author

My Vibe also has a special setting, which we use a separate ranking stack for: Unfamiliar. We had a separate ARGUS implementation for this setting, achieving a 12% increase in total listening time and a 10% growth in likelihood. The Unfamiliar setting is used by people who are interested in discovering new recommendations. The fact that we experienced a significant increase in this category confirms that ARGUS is more effective at handling non-trivial scenarios.

We implemented ARGUS in music scenarios on smart devices and successfully increased the total time users spend with an active speaker by 0.75%. Here, the final ranker is not a gradient boosting model, but a full-scale ranking neural network. Because of this, we were able to not only feed a single scalar feature from ARGUS but also pass full user and item vectors as input to the final ranker. Compared to a single scalar feature, this increased the quality gain by another one and a half to two times.

ARGUS has already been implemented not only as a ranking feature, but also to generate candidates. The team has adapted the offline ARGUS into a runtime version. These implementations yielded significant gains in key metrics. Neural networks are the future of recommender systems but there’s still a long journey ahead.

Thank you for reading.

Source link

#Scaling #Recommender #Transformers #Billion #Parameters

Scaling Recommender Transformers to a Billion Parameters

How Recommender Systems Work

How Do We Encode a User Into a Vector?

Scaling Recommender Systems

RecSys × LLM × RL

ARGUS: AutoRegressive Generative User Sequential Modeling

Next item prediction

Feedback prediction

Simplified ARGUS

ARGUS Implementation

Results

A/B Test Results and Measurements

Implementation Results

Recent Posts

Scaling Recommender Transformers to a Billion Parameters

YouTube’s likeness detection has arrived to help stop AI doppelgängers

Sperm From Older Men Have More Genetic Mutations

The Download: Embryo ethics, and reducing chatbot risks

Anker’s latest noise-canceling sleep earbuds are nearly $40 off

Melania Trump Used as ‘Window-Dressing’ in Elaborate Memecoin Fraud, Legal Filing Claims

AI Slop Now Invading Spotify’s Discover Weekly Lists

OpenAI’s Atlas Browser Takes Direct Aim at Google Chrome

Hulu with Live TV costs $90 monthly now – but you can get $25 off for 3 months

Findem Lands $36M Series C To Supercharge AI-Powered Hiring