Hop in to Recommender System with Matrix Factorization

Deshiwa Budilaksana
3 min readNov 4, 2020

As artificial intelligence and machine learning evolve, the need for it becomes greater. The recommendation system as part of machine learning is one of the means to start entering this world. Matrix factorization is one approach in the recommendation system that is commonly used because of its powerful predictability and easy to learn.

This time, the recommendation system created will provide movie recommendations to users using the Movielens 20M dataset and written in Python with Jupyter Notebook provided by Conda. Before starting to provide recommendations, it is necessary to prepare several libraries, namely FindSpark, PySpark and Pandas. The first thing to do is import the Spark library and then initialize it.

import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('movie_recommender').
getOrCreate()

Next, import the dataset with pandas then save it with the name “movie”, here it will only store movie data and there is no assessment or user in it. From the existing dataset, the only columns that will be needed are “movieId” and “title”

import pandas as pd
movie = pd.read_csv(“movie.csv”)
movie = movie.loc[:,[“movieId”, “title”]]
movie.head(10)
Movies List

After getting movie data, the next step is to take the user’s movie rating data, still in the same way as used with the data above. From the existing dataset, the columns that will be needed are “userId”, “movieId” and “rating”.

rating = pd.read_csv(“rating.csv”)
print(rating.columns)
rating = rating.loc[:,[“userId”, “movieId”, “rating”]]
rating.head(10)
Movie Rating

From the existing film and rating data, the two will then be combined, and here the data that will be used is only 1 million of the total 20 million in the dataset.

data = pd.merge(movie, rating)
data = data.iloc[:1000000,:]
data.head(10)
Merged Data

The data is set up, and we’ll start making recommendations. Starting with importing important libraries from Pyspark, this time, the recommendation will be done with ALS and to test the recommendation will be done with Mean Squared Error

from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS
from pyspark.ml.feature import StringIndexer
from pyspark.ml import Pipeline
from pyspark.sql.functions import col

Because the available data is in the form of a Pandas DataFrame, the data needs to be converted into a Spark DataFrame first

sparkdf = spark.createDataFrame(data)

The data that has been converted in the form of a Spark DataFrame is then entered into a StringIndexer which is then used as a Pipeline

indexer = [StringIndexer(inputCol=column, outputCol=column+”_index”)        
for column in list(set(sparkdf.columns)-set([‘rating’]))]
pipeline = Pipeline(stages=indexer)
transformed = pipeline.fit(sparkdf).transform(sparkdf)
transformed.show()
Spark DataFrame

Enter the recommendation stage, first separate the data into test data and training data with a comparison of 20% test data and 80% training data. The model is then designed using the ALS spark library and then train the model.

(train, test) = transformed.randomSplit(([0.8, 0.2]))
als = ALS(maxIter=5, regParam=0.09, rank=25, userCol="userId_index",
itemCol="movieId_index", ratingCol="rating",
coldStartStrategy="drop", nonnegative=True)
model = als.fit(train)

then do a rating prediction with the test data that has been prepared

predictions = model.transform(test)
predictions.show()
Predictions Result

That’s all for the application of matrix factorization in film recommendations, from this application it is still very open to exploration starting from increasing accuracy to making the best performance that is proportional to the accuracy, have a nice day

--

--