Recommend movies for users with TensorFlow Ranking

View on TensorFlow.org Run in Google Colab View source on GitHub Download notebook

In this tutorial, we build a simple two tower ranking model using the MovieLens 100K dataset with TF-Ranking. We can use this model to rank and recommend movies for a given user according to their predicted user ratings.

Setup

Install and import the TF-Ranking library:

pipinstall-qtensorflow-ranking
pipinstall-q--upgradetensorflow-datasets
fromtypingimport Dict, Tuple
importtensorflowastf
importtensorflow_datasetsastfds
importtensorflow_rankingastfr
2024年03月19日 11:34:49.704174: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024年03月19日 11:34:49.704225: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024年03月19日 11:34:49.705795: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered

Read the data

Prepare to train a model by creating a ratings dataset and movies dataset. Use user_id as the query input feature, movie_title as the document input feature, and user_rating as the label to train the ranking model.

%%capture --no-display
# Ratings data.
ratings = tfds.load('movielens/100k-ratings', split="train")
# Features of all the available movies.
movies = tfds.load('movielens/100k-movies', split="train")
# Select the basic features.
ratings = ratings.map(lambda x: {
 "movie_title": x["movie_title"],
 "user_id": x["user_id"],
 "user_rating": x["user_rating"]
})
2024年03月19日 11:34:53.385017: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:274] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected

Build vocabularies to convert all user ids and all movie titles into integer indices for embedding layers:

movies = movies.map(lambda x: x["movie_title"])
users = ratings.map(lambda x: x["user_id"])
user_ids_vocabulary = tf.keras.layers.experimental.preprocessing.StringLookup(
 mask_token=None)
user_ids_vocabulary.adapt(users.batch(1000))
movie_titles_vocabulary = tf.keras.layers.experimental.preprocessing.StringLookup(
 mask_token=None)
movie_titles_vocabulary.adapt(movies.batch(1000))

Group by user_id to form lists for ranking models:

key_func = lambda x: user_ids_vocabulary(x["user_id"])
reduce_func = lambda key, dataset: dataset.batch(100)
ds_train = ratings.group_by_window(
 key_func=key_func, reduce_func=reduce_func, window_size=100)
for x in ds_train.take(1):
 for key, value in x.items():
 print(f"Shape of {key}: {value.shape}")
 print(f"Example values of {key}: {value[:5].numpy()}")
 print()
Shape of movie_title: (100,)
Example values of movie_title: [b'Man Who Would Be King, The (1975)' b'Silence of the Lambs, The (1991)'
 b'Next Karate Kid, The (1994)' b'2001: A Space Odyssey (1968)'
 b'Usual Suspects, The (1995)']
Shape of user_id: (100,)
Example values of user_id: [b'405' b'405' b'405' b'405' b'405']
Shape of user_rating: (100,)
Example values of user_rating: [1. 4. 1. 5. 5.]

Generate batched features and labels:

def_features_and_labels(
 x: Dict[str, tf.Tensor]) -> Tuple[Dict[str, tf.Tensor], tf.Tensor]:
 labels = x.pop("user_rating")
 return x, labels
ds_train = ds_train.map(_features_and_labels)
ds_train = ds_train.apply(
 tf.data.experimental.dense_to_ragged_batch(batch_size=32))
WARNING:tensorflow:From /tmpfs/tmp/ipykernel_12750/4021484596.py:10: dense_to_ragged_batch (from tensorflow.python.data.experimental.ops.batching) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.ragged_batch` instead.

The user_id and movie_title tensors generated in ds_train are of shape [32, None], where the second dimension is 100 in most cases except for the batches when less than 100 items grouped in lists. A model working on ragged tensors is thus used.

for x, label in ds_train.take(1):
 for key, value in x.items():
 print(f"Shape of {key}: {value.shape}")
 print(f"Example values of {key}: {value[:3,:3].numpy()}")
 print()
 print(f"Shape of label: {label.shape}")
 print(f"Example values of label: {label[:3,:3].numpy()}")
Shape of movie_title: (32, None)
Example values of movie_title: [[b'Man Who Would Be King, The (1975)'
 b'Silence of the Lambs, The (1991)' b'Next Karate Kid, The (1994)']
 [b'Flower of My Secret, The (Flor de mi secreto, La) (1995)'
 b'Little Princess, The (1939)' b'Time to Kill, A (1996)']
 [b'Kundun (1997)' b'Scream (1996)' b'Power 98 (1995)']]
Shape of user_id: (32, None)
Example values of user_id: [[b'405' b'405' b'405']
 [b'655' b'655' b'655']
 [b'13' b'13' b'13']]
Shape of label: (32, None)
Example values of label: [[1. 4. 1.]
 [3. 3. 3.]
 [5. 1. 1.]]

Define a model

Define a ranking model by inheriting from tf.keras.Model and implementing the call method:

classMovieLensRankingModel(tf.keras.Model):
 def__init__(self, user_vocab, movie_vocab):
 super().__init__()
 # Set up user and movie vocabulary and embedding.
 self.user_vocab = user_vocab
 self.movie_vocab = movie_vocab
 self.user_embed = tf.keras.layers.Embedding(user_vocab.vocabulary_size(),
 64)
 self.movie_embed = tf.keras.layers.Embedding(movie_vocab.vocabulary_size(),
 64)
 defcall(self, features: Dict[str, tf.Tensor]) -> tf.Tensor:
 # Define how the ranking scores are computed: 
 # Take the dot-product of the user embeddings with the movie embeddings.
 user_embeddings = self.user_embed(self.user_vocab(features["user_id"]))
 movie_embeddings = self.movie_embed(
 self.movie_vocab(features["movie_title"]))
 return tf.reduce_sum(user_embeddings * movie_embeddings, axis=2)

Create the model, and then compile it with ranking tfr.keras.losses and tfr.keras.metrics, which are the core of the TF-Ranking package.

This example uses a ranking-specific softmax loss, which is a listwise loss introduced to promote all relevant items in the ranking list with better chances on top of the irrelevant ones. In contrast to the softmax loss in the multi-class classification problem, where only one class is positive and the rest are negative, the TF-Ranking library supports multiple relevant documents in a query list and non-binary relevance labels.

For ranking metrics, this example uses in specific Normalized Discounted Cumulative Gain (NDCG) and Mean Reciprocal Rank (MRR), which calculate the user utility of a ranked query list with position discounts. For more details about ranking metrics, review evaluation measures offline metrics.

# Create the ranking model, trained with a ranking loss and evaluated with
# ranking metrics.
model = MovieLensRankingModel(user_ids_vocabulary, movie_titles_vocabulary)
optimizer = tf.keras.optimizers.Adagrad(0.5)
loss = tfr.keras.losses.get(
 loss=tfr.keras.losses.RankingLossKey.SOFTMAX_LOSS, ragged=True)
eval_metrics = [
 tfr.keras.metrics.get(key="ndcg", name="metric/ndcg", ragged=True),
 tfr.keras.metrics.get(key="mrr", name="metric/mrr", ragged=True)
]
model.compile(optimizer=optimizer, loss=loss, metrics=eval_metrics)

Train and evaluate the model

Train the model with model.fit.

model.fit(ds_train, epochs=3)
Epoch 1/3
48/48 [==============================] - 7s 56ms/step - loss: 998.7637 - metric/ndcg: 0.8213 - metric/mrr: 1.0000
Epoch 2/3
48/48 [==============================] - 4s 53ms/step - loss: 997.1824 - metric/ndcg: 0.9161 - metric/mrr: 1.0000
Epoch 3/3
48/48 [==============================] - 4s 53ms/step - loss: 994.8384 - metric/ndcg: 0.9383 - metric/mrr: 1.0000
<keras.src.callbacks.History at 0x7f666424d700>

Generate predictions and evaluate.

# Get movie title candidate list.
for movie_titles in movies.batch(2000):
 break
# Generate the input for user 42.
inputs = {
 "user_id":
 tf.expand_dims(tf.repeat("42", repeats=movie_titles.shape[0]), axis=0),
 "movie_title":
 tf.expand_dims(movie_titles, axis=0)
}
# Get movie recommendations for user 42.
scores = model(inputs)
titles = tfr.utils.sort_by_scores(scores,
 [tf.expand_dims(movie_titles, axis=0)])[0]
print(f"Top 5 recommendations for user 42: {titles[0,:5]}")
Top 5 recommendations for user 42: [b'Star Wars (1977)' b'Liar Liar (1997)' b'Toy Story (1995)'
 b'Raiders of the Lost Ark (1981)' b'Sound of Music, The (1965)']

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2024年03月19日 UTC.