Pyspark Song Recommender System

Photo by rawpixel on Unsplash

Introduction

In this final project, we apply big data tools to build and evaluate a collaborative filter based recommender system. The dataset we work on is the Million Song Dataset (MSD), with implicit user feedback. Using Spark’s alternating least squares (ALS) method, we learn latent factor representations for users and items, and recommend for users in the test set. Thereafter, we compare our model to single-machine implementation, and the baseline for the extension.

Data Overview

Data used for the basic recommendation system consists of the train, validation, and test parquet files. Each row in the files consists of user_id (string), count (int), and track_id (string). There are 49,824,519 records for the train, 135,938 records for the validation, and 1,368,430 for the test. Additional data including metadata, features, genre tags, lyric, are also used for the extension

Report

Please review the report pdf for details.

Di He
Di He
Senior Data Scientist

I am passionate in applying data science in real world

Related