Recommender Systems in Keras

A programming introduction to recommender systems using Keras!
ML
Author

Nipun Batra

Published

December 18, 2017

I have written a few posts earlier about matrix factorisation using various Python libraries. The main application I had in mind for matrix factorisation was recommender systems. In this post, I’ll write about using Keras for creating recommender systems. Various people have written excellent similar posts and code that I draw a lot of inspiration from, and give them their credit! I’m assuming that a reader has some experience with Keras, as this post is not intended to be an introduction to Keras.

Specifically, in this post, I’ll talk about:

  1. Matrix Factorisation in Keras
  2. Adding non-negativitiy constraints to solve non-negative matrix factorisation (NNMF)
  3. Using neural networks for recommendations

I’ll be using the Movielens-100k dataset for illustration. There are 943 users and 1682 movies. In total there are a 100k ratings in the dataset. It should be noted that the max. total number of rating for the <users, movies> would be 943*1682, which means that we have about 7% of the total ratings! All rating are on a scale of 1-5.

Task

Given this set of ratings, can we recommend the next set of movies to a user? This would translate to: for every user, estimating the ratings for all the movies that (s)he hasn’t watched and maybe recommend the top-k movies by the esimtated ratings!

Peak into the dataset

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

%matplotlib inline
dataset = pd.read_csv("/Users/nipun/Downloads/ml-100k/u.data",sep='\t',names="user_id,item_id,rating,timestamp".split(","))
dataset.head()
user_id item_id rating timestamp
0 196 242 3 881250949
1 186 302 3 891717742
2 22 377 1 878887116
3 244 51 2 880606923
4 166 346 1 886397596

So, each record (row) shows the rating for a user, item (movie) pair. It should be noted that I use item and movie interchangeably in this post.

len(dataset.user_id.unique()), len(dataset.item_id.unique())
(943, 1682)

We assign a unique number between (0, #users) to each user and do the same for movies.

dataset.user_id = dataset.user_id.astype('category').cat.codes.values
dataset.item_id = dataset.item_id.astype('category').cat.codes.values
dataset.head()
user_id item_id rating timestamp
0 195 241 3 881250949
1 185 301 3 891717742
2 21 376 1 878887116
3 243 50 2 880606923
4 165 345 1 886397596

Train test split

We’ll now split our dataset of 100k ratings into train (containing 80k ratings) and test (containing 20k ratings). Given the train set, we’d like to accurately estimate the ratings in the test set.

from sklearn.model_selection import train_test_split
train, test = train_test_split(dataset, test_size=0.2)
train.head()
user_id item_id rating timestamp
90092 832 12 2 875036139
50879 94 132 3 888954341
67994 436 12 4 880141129
49769 710 344 4 884485683
11032 121 736 4 879270874
test.head()
user_id item_id rating timestamp
89284 907 493 3 879723046
60499 550 25 4 892785056
11090 373 222 5 880394520
36096 199 140 4 884129346
21633 71 317 5 880037702

Matrix factorisation

One popular recommender systems approach is called Matrix Factorisation. It works on the principle that we can learn a low-dimensional representation (embedding) of user and movie. For example, for each movie, we can have how much action it has, how long it is, and so on. For each user, we can encode how much they like action, or how much they like long movies, etc. Thus, we can combine the user and the movie embeddings to estimate the ratings on unseen movies. This approach can also be viewed as: given a matrix (A [M X N]) containing users and movies, we want to estimate low dimensional matrices (W [M X k] and H [M X k]), such that: \(A \approx W.H^T\)

Matrix factorisation in Keras

We’ll now write some code to solve the recommendation problem by matrix factorisation in Keras. We’re trying to learn two low-dimensional embeddings of users and items.

import keras
from IPython.display import SVG
from keras.optimizers import Adam
from keras.utils.vis_utils import model_to_dot
n_users, n_movies = len(dataset.user_id.unique()), len(dataset.item_id.unique())
n_latent_factors = 3
Using TensorFlow backend.

The key thing is to learn an embedding for movies and users, and then combine them using the dot product! For estimating the rating, for each user, movie pair of interest, we’d take the dot product of the respective user and item embedding. As an example, if we have 2 dimensions in our user and item embedding, which say correspond to [how much user likes action, how much user likes long movies], and the item embedding is [how much action is in the movie, how long is the movie]. Then, we can predict for a user u, and movie m as how much u likes action \(\times\) how much action is there in m \(+\) how much u likes long movies \(\times\) how long is m.

Our model would optimise the emebedding such that we minimise the mean squared error on the ratings from the train set.

movie_input = keras.layers.Input(shape=[1],name='Item')
movie_embedding = keras.layers.Embedding(n_movies + 1, n_latent_factors, name='Movie-Embedding')(movie_input)
movie_vec = keras.layers.Flatten(name='FlattenMovies')(movie_embedding)

user_input = keras.layers.Input(shape=[1],name='User')
user_vec = keras.layers.Flatten(name='FlattenUsers')(keras.layers.Embedding(n_users + 1, n_latent_factors,name='User-Embedding')(user_input))

prod = keras.layers.merge([movie_vec, user_vec], mode='dot',name='DotProduct')
model = keras.Model([user_input, movie_input], prod)
model.compile('adam', 'mean_squared_error')

Here’s a visualisation of our model for a better understanding.

SVG(model_to_dot(model,  show_shapes=True, show_layer_names=True, rankdir='HB').create(prog='dot', format='svg'))

We can see that in the Merge layer, we take the dot product of the user and the item embeddings to obtain the rating.

We can also summarise our model as follows:

model.summary()
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
Item (InputLayer)               (None, 1)            0                                            
__________________________________________________________________________________________________
User (InputLayer)               (None, 1)            0                                            
__________________________________________________________________________________________________
Movie-Embedding (Embedding)     (None, 1, 3)         5049        Item[0][0]                       
__________________________________________________________________________________________________
User-Embedding (Embedding)      (None, 1, 3)         2832        User[0][0]                       
__________________________________________________________________________________________________
FlattenMovies (Flatten)         (None, 3)            0           Movie-Embedding[0][0]            
__________________________________________________________________________________________________
FlattenUsers (Flatten)          (None, 3)            0           User-Embedding[0][0]             
__________________________________________________________________________________________________
DotProduct (Merge)              (None, 1)            0           FlattenMovies[0][0]              
                                                                 FlattenUsers[0][0]               
==================================================================================================
Total params: 7,881
Trainable params: 7,881
Non-trainable params: 0
__________________________________________________________________________________________________

So, we have 7881 parameters to learn! Let’s train our model now!

history = model.fit([train.user_id, train.item_id], train.rating, epochs=100, verbose=0)

Train error v/s epoch number

Before we test how well our model does in the test setting, we can visualise the train loss with epoch number.

pd.Series(history.history['loss']).plot(logy=True)
plt.xlabel("Epoch")
plt.ylabel("Train Error")

Prediction error

Let’s now see how our model does! I’ll do a small post-processing step to round off our prediction to the nearest integer. This is usually not done, and thus just a whimsical step, since the training ratings are all integers! There are better ways to encode this intger requirement (one-hot encoding!), but we won’t discuss them in this post.

y_hat = np.round(model.predict([test.user_id, test.item_id]),0)
y_true = test.rating
from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_true, y_hat)
0.6915

Not bad! We’re able to get a \(MAE\) of 0.69! I’m sure with a bit of parameter/hyper-parameter optimisation, we may be able to improve the results. However, I won’t talk about these optimisations in this post.

Extracting the learnt embeddings

We can extract the learnt movie and item embeddings as follows:

movie_embedding_learnt = model.get_layer(name='Movie-Embedding').get_weights()[0]
pd.DataFrame(movie_embedding_learnt).describe()
0 1 2
count 1683.000000 1683.000000 1683.000000
mean -0.935420 0.857862 0.954169
std 0.517458 0.447439 0.458095
min -2.524487 -0.459752 -0.989537
25% -1.323431 0.546364 0.642444
50% -0.949188 0.851243 0.993619
75% -0.550862 1.159588 1.283555
max 0.500618 2.140607 2.683658
user_embedding_learnt = model.get_layer(name='User-Embedding').get_weights()[0]
pd.DataFrame(user_embedding_learnt).describe()
0 1 2
count 944.000000 944.000000 944.000000
mean -1.126231 1.171609 1.109131
std 0.517478 0.409016 0.548384
min -2.883226 -0.500010 -0.415373
25% -1.458197 0.903574 0.735729
50% -1.159480 1.199517 1.084089
75% -0.836746 1.456610 1.468611
max 0.899436 2.605330 2.826109

We can see that both the user and the item embeddings have negative elements. There are some applications which require that the learnt embeddings be non-negative. This approach is also called non-negative matrix factorisation, which we’ll workout now.

Non-negative Matrix factorisation (NNMF) in Keras

The code for NNMF remains exactly the same as the code for matrix factorisation. The only change is that we add non-negativity constraints on the learnt embeddings. This is done as follows:

from keras.constraints import non_neg
movie_input = keras.layers.Input(shape=[1],name='Item')
movie_embedding = keras.layers.Embedding(n_movies + 1, n_latent_factors, name='NonNegMovie-Embedding', embeddings_constraint=non_neg())(movie_input)
movie_vec = keras.layers.Flatten(name='FlattenMovies')(movie_embedding)

user_input = keras.layers.Input(shape=[1],name='User')
user_vec = keras.layers.Flatten(name='FlattenUsers')(keras.layers.Embedding(n_users + 1, n_latent_factors,name='NonNegUser-Embedding',embeddings_constraint=non_neg())(user_input))

prod = keras.layers.merge([movie_vec, user_vec], mode='dot',name='DotProduct')
model = keras.Model([user_input, movie_input], prod)
model.compile('adam', 'mean_squared_error')

We now verify if we are indeed able to learn non-negative embeddings. I’ll not compare the performance of NNMF on the test set, in the interest of space.

history_nonneg = model.fit([train.user_id, train.item_id], train.rating, epochs=10, verbose=0)
movie_embedding_learnt = model.get_layer(name='NonNegMovie-Embedding').get_weights()[0]
pd.DataFrame(movie_embedding_learnt).describe()
0 1 2
count 1683.000000 1683.000000 1683.000000
mean 0.838450 0.840330 0.838066
std 0.301618 0.301529 0.301040
min -0.000000 -0.000000 -0.000000
25% 0.657749 0.663951 0.656453
50% 0.901495 0.904192 0.895934
75% 1.072706 1.073591 1.072926
max 1.365719 1.379006 1.373672

Looks good!

Neural networks for recommendation

We’ll now create a simple neural network for recommendation, or for estimating rating! This model is very similar to the earlier matrix factorisation models, but differs in the following ways:

  1. Instead of taking a dot product of the user and the item embedding, we concatenate them and use them as features for our neural network. Thus, we are not constrained to the dot product way of combining the embeddings, and can learn complex non-linear relationships.
  2. Due to #1, we can now have a different dimension of user and item embeddings. This can be useful if one dimension is larger than the other.
n_latent_factors_user = 5
n_latent_factors_movie = 8

movie_input = keras.layers.Input(shape=[1],name='Item')
movie_embedding = keras.layers.Embedding(n_movies + 1, n_latent_factors_movie, name='Movie-Embedding')(movie_input)
movie_vec = keras.layers.Flatten(name='FlattenMovies')(movie_embedding)
movie_vec = keras.layers.Dropout(0.2)(movie_vec)


user_input = keras.layers.Input(shape=[1],name='User')
user_vec = keras.layers.Flatten(name='FlattenUsers')(keras.layers.Embedding(n_users + 1, n_latent_factors_user,name='User-Embedding')(user_input))
user_vec = keras.layers.Dropout(0.2)(user_vec)


concat = keras.layers.merge([movie_vec, user_vec], mode='concat',name='Concat')
concat_dropout = keras.layers.Dropout(0.2)(concat)
dense = keras.layers.Dense(200,name='FullyConnected')(concat)
dropout_1 = keras.layers.Dropout(0.2,name='Dropout')(dense)
dense_2 = keras.layers.Dense(100,name='FullyConnected-1')(concat)
dropout_2 = keras.layers.Dropout(0.2,name='Dropout')(dense_2)
dense_3 = keras.layers.Dense(50,name='FullyConnected-2')(dense_2)
dropout_3 = keras.layers.Dropout(0.2,name='Dropout')(dense_3)
dense_4 = keras.layers.Dense(20,name='FullyConnected-3', activation='relu')(dense_3)


result = keras.layers.Dense(1, activation='relu',name='Activation')(dense_4)
adam = Adam(lr=0.005)
model = keras.Model([user_input, movie_input], result)
model.compile(optimizer=adam,loss= 'mean_absolute_error')

Let’s now see how our model looks like:

SVG(model_to_dot(model,  show_shapes=True, show_layer_names=True, rankdir='HB').create(prog='dot', format='svg'))

It should be noted that we use a different number of embeddings for user (3) and items (5)! These combine to form a vector of length (5+3 = 8), which is then fed into the neural network. We also add a dropout layer to prevent overfitting!

model.summary()
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
Item (InputLayer)               (None, 1)            0                                            
__________________________________________________________________________________________________
User (InputLayer)               (None, 1)            0                                            
__________________________________________________________________________________________________
Movie-Embedding (Embedding)     (None, 1, 8)         13464       Item[0][0]                       
__________________________________________________________________________________________________
User-Embedding (Embedding)      (None, 1, 5)         4720        User[0][0]                       
__________________________________________________________________________________________________
FlattenMovies (Flatten)         (None, 8)            0           Movie-Embedding[0][0]            
__________________________________________________________________________________________________
FlattenUsers (Flatten)          (None, 5)            0           User-Embedding[0][0]             
__________________________________________________________________________________________________
dropout_1 (Dropout)             (None, 8)            0           FlattenMovies[0][0]              
__________________________________________________________________________________________________
dropout_2 (Dropout)             (None, 5)            0           FlattenUsers[0][0]               
__________________________________________________________________________________________________
Concat (Merge)                  (None, 13)           0           dropout_1[0][0]                  
                                                                 dropout_2[0][0]                  
__________________________________________________________________________________________________
FullyConnected-1 (Dense)        (None, 100)          1400        Concat[0][0]                     
__________________________________________________________________________________________________
FullyConnected-2 (Dense)        (None, 50)           5050        FullyConnected-1[0][0]           
__________________________________________________________________________________________________
FullyConnected-3 (Dense)        (None, 20)           1020        FullyConnected-2[0][0]           
__________________________________________________________________________________________________
Activation (Dense)              (None, 1)            21          FullyConnected-3[0][0]           
==================================================================================================
Total params: 25,675
Trainable params: 25,675
Non-trainable params: 0
__________________________________________________________________________________________________

We can see that the number of parameters is more than what we had in the Matrix Factorisation case. Let’s see how this model works. I’ll run it for more epochs given that we have more parameters.

history = model.fit([train.user_id, train.item_id], train.rating, epochs=250, verbose=0)

Prediction performance of Neural Network based recommender system

y_hat_2 = np.round(model.predict([test.user_id, test.item_id]),0)
print(mean_absolute_error(y_true, y_hat_2))

print(mean_absolute_error(y_true, model.predict([test.user_id, test.item_id])))
0.6957
0.708807692927

Pretty similar to the result we got using matrix factorisation. Maybe, we need to tweak around a lot more with the neural network to get better results?

Thanks for reading. This post has been a good learning experience for me. Hope you enjoyed too!