Out of matrix non-negative matrix factorisation

I have written a bunch of posts on this blog about non-negative matrix factorisation (NNMF). However, all of them involved the test user to be a part of the matrix that we factorise to learn the latent factors. Is that always the case? Read on to find more!

Standard Problem

Our goal is given a matrix A, decompose it into two non-negative factors, as follows:

$ A_{M \times N} \approx W_{M \times K} \times H_{K \times N} $, such that $ W_{M \times K} \ge 0$ and $ H_{K \times N} \ge 0$

NMF Problem

Our Problem- Out of matrix factorisation

Imagine that we have trained the model for M-1 users on N movies. Now, the $M^{th}$ user has rated some movies. Do we retrain the model from scratch to consider $M^{th}$ user? This can be a very expensive operation!

NMF Problem

Instead, as shown in above figure, we will learn the user factor for the $M^{th}$ user. We can do this the shared movie factor (H) has already been learnt.

We can formulate as follows:

$$ A[M,:] = W[M,:]H $$

Taking transpose both sides

$$ A[M,:]^T = H^T W[M,:]^T $$

However, $A[M,:]^T$ will have missing entries. Thus, we can mask those entries from the calculation as shown below.

NMF Problem

Thus, we can write

$$ W[M,:]^T = \mathrm{Least Squares} (H^T[Mask], A[M,:]^T[Mask]) $$

If instead we want the factors to be non-negative, we can use non-negative least squares instead of usual least squares for this estimation.

Code example

I'll now present a simple code example to illustrate the procedure.

Defining matrix A

In [1]:
import numpy as np
import pandas as pd

M, N = 20, 10

np.random.seed(0)
A_orig = np.abs(np.random.uniform(low=0.0, high=1.0, size=(M,N)))
pd.DataFrame(A_orig).head()
Out[1]:
0 1 2 3 4 5 6 7 8 9
0 0.548814 0.715189 0.602763 0.544883 0.423655 0.645894 0.437587 0.891773 0.963663 0.383442
1 0.791725 0.528895 0.568045 0.925597 0.071036 0.087129 0.020218 0.832620 0.778157 0.870012
2 0.978618 0.799159 0.461479 0.780529 0.118274 0.639921 0.143353 0.944669 0.521848 0.414662
3 0.264556 0.774234 0.456150 0.568434 0.018790 0.617635 0.612096 0.616934 0.943748 0.681820
4 0.359508 0.437032 0.697631 0.060225 0.666767 0.670638 0.210383 0.128926 0.315428 0.363711

Masking a few entries

In [2]:
A = A_orig.copy()
A[0, 0] = np.NAN
A[3, 1] = np.NAN
A[6, 3] = np.NAN

# Masking for last user. 
A[19, 2] = np.NAN
A[19, 7] = np.NAN

We will be using A2 (first 19 users) matrix for learning the movie factors and the user factors for the 19 users

In [3]:
A2 = A[:-1,:]
A2.shape
Out[3]:
(19, 10)
In [4]:
A_df = pd.DataFrame(A)
A_df.head()
Out[4]:
0 1 2 3 4 5 6 7 8 9
0 NaN 0.715189 0.602763 0.544883 0.423655 0.645894 0.437587 0.891773 0.963663 0.383442
1 0.791725 0.528895 0.568045 0.925597 0.071036 0.087129 0.020218 0.832620 0.778157 0.870012
2 0.978618 0.799159 0.461479 0.780529 0.118274 0.639921 0.143353 0.944669 0.521848 0.414662
3 0.264556 NaN 0.456150 0.568434 0.018790 0.617635 0.612096 0.616934 0.943748 0.681820
4 0.359508 0.437032 0.697631 0.060225 0.666767 0.670638 0.210383 0.128926 0.315428 0.363711

Defining matrices W and H (learning on M-1 users and N movies)

In [5]:
K = 4
W = np.abs(np.random.uniform(low=0, high=1, size=(M-1, K)))
H = np.abs(np.random.uniform(low=0, high=1, size=(K, N)))
W = np.divide(W, K*W.max())
H = np.divide(H, K*H.max())
In [6]:
pd.DataFrame(W).head()
Out[6]:
0 1 2 3
0 0.078709 0.175784 0.095359 0.045339
1 0.006230 0.016976 0.171505 0.114531
2 0.135453 0.226355 0.250000 0.054753
3 0.167387 0.066473 0.005213 0.191444
4 0.080785 0.096801 0.148514 0.209789
In [7]:
pd.DataFrame(H).head()
Out[7]:
0 1 2 3 4 5 6 7 8 9
0 0.239700 0.203498 0.160529 0.222617 0.074611 0.216164 0.157328 0.003370 0.088415 0.037721
1 0.250000 0.121806 0.126649 0.162827 0.093851 0.034858 0.209333 0.048340 0.130195 0.057117
2 0.024914 0.219537 0.247731 0.244654 0.230833 0.197093 0.084828 0.020651 0.103694 0.059133
3 0.033735 0.013604 0.184756 0.002910 0.196210 0.037417 0.020248 0.022815 0.171121 0.062477

Defining the cost that we want to minimise

In [8]:
def cost(A, W, H):
    from numpy import linalg
    WH = np.dot(W, H)
    A_WH = A-WH
    return linalg.norm(A_WH, 'fro')

However, since A has missing entries, we have to define the cost in terms of the entries present in A

In [9]:
def cost(A, W, H):
    from numpy import linalg
    mask = pd.DataFrame(A).notnull().values
    WH = np.dot(W, H)
    WH_mask = WH[mask]
    A_mask = A[mask]
    A_WH_mask = A_mask-WH_mask
    # Since now A_WH_mask is a vector, we use L2 instead of Frobenius norm for matrix
    return linalg.norm(A_WH_mask, 2)

Let us just try to see the cost of the initial set of values of W and H we randomly assigned. Notice, we pass A2!

In [10]:
cost(A2, W, H)
Out[10]:
7.2333001567031294

Alternating NNLS procedure

In [11]:
num_iter = 1000
num_display_cost = max(int(num_iter/10), 1)
from scipy.optimize import nnls

for i in range(num_iter):
    if i%2 ==0:
        # Learn H, given A and W
        for j in range(N):
            mask_rows = pd.Series(A2[:,j]).notnull()
            H[:,j] = nnls(W[mask_rows], A2[:,j][mask_rows])[0]
    else:
        for j in range(M-1):
            mask_rows = pd.Series(A2[j,:]).notnull()
            W[j,:] = nnls(H.transpose()[mask_rows], A2[j,:][mask_rows])[0]
    WH = np.dot(W, H)
    c = cost(A2, W, H)
    if i%num_display_cost==0:
        print i, c
        
0 3.74162948918
100 2.25416363991
200 2.25258698617
300 2.25229707846
400 2.25131714233
500 2.24968386447
600 2.24967129897
700 2.24965023589
800 2.24961410381
900 2.24955008837
In [12]:
A_pred = pd.DataFrame(np.dot(W, H))
A_pred.head()
Out[12]:
0 1 2 3 4 5 6 7 8 9
0 0.590301 0.653038 0.531940 0.623272 0.584763 0.630835 0.574041 0.700139 0.841706 0.565808
1 0.802724 0.532299 0.482430 1.017968 0.149923 0.449312 0.097775 0.708727 0.506595 0.846219
2 0.764296 0.563711 0.527292 0.905236 0.306275 0.505674 0.223192 0.705882 0.604356 0.757878
3 0.373539 0.745239 0.334948 0.663219 0.132686 0.551844 0.760420 0.598546 0.808108 0.627732
4 0.467623 0.331457 0.617263 0.239449 0.634455 0.370041 0.294412 0.288539 0.484822 0.126945

Learning home factors for $M^{th}$ home

In [13]:
A_m = A[-1,:]
A_m_transpose = A_m.T
mask = ~np.isnan(A_m_transpose)
W_m = nnls(H.T[mask], A_m_transpose[mask])[0]
In [14]:
W_m
Out[14]:
array([ 0.12248095,  0.20778687,  0.15185613,  0.        ])

Predicting for $M^{th}$ home

In [15]:
ratings_m_home = np.dot(H.T, W_m)
In [16]:
ratings_m_home[~mask]
Out[16]:
array([ 0.4245947 ,  0.57447552])
In [17]:
A_orig[-1,:][~mask]
Out[17]:
array([ 0.18619301,  0.25435648])

There you go, we are able to get ratings for the $M^{th}$ user for the movies that they have not seen. We only trained the model on the other users! Ofcourse, these numbers might not look so impressive. However, this was just a toy example based on random data. In reality, we could expect better results!

>>