
Introduction
In this final project, we apply big data tools to build and evaluate a collaborative filter based recommender system. The dataset we work on is the Million Song Dataset (MSD), with implicit user feedback. Using Spark’s alternating least squares (ALS) method, we learn latent factor representations for users and items, and recommend for users in the test set. Thereafter, we compare our model to single-machine implementation, and the baseline for the extension.
Data Overview
Data used for the basic recommendation system consists of the train, validation, and test parquet files. Each row in the files consists of user_id (string), count (int), and track_id (string). There are 49,824,519 records for the train, 135,938 records for the validation, and 1,368,430 for the test. Additional data including metadata, features, genre tags, lyric, are also used for the extension
Report
Please review the report pdf for details.