Pyspark Song Recommender System

Apr 30, 2021

Photo by rawpixel on Unsplash

Introduction

In this final project, we apply big data tools to build and evaluate a collaborative filter based recommender system. The dataset we work on is the Million Song Dataset (MSD), with implicit user feedback. Using Spark’s alternating least squares (ALS) method, we learn latent factor representations for users and items, and recommend for users in the test set. Thereafter, we compare our model to single-machine implementation, and the baseline for the extension.

Data Overview

Data used for the basic recommendation system consists of the train, validation, and test parquet files. Each row in the files consists of user_id (string), count (int), and track_id (string). There are 49,824,519 records for the train, 135,938 records for the validation, and 1,368,430 for the test. Additional data including metadata, features, genre tags, lyric, are also used for the extension

Report

Please review the report pdf for details.

Machine Learning

Pyspark Song Recommender System

Introduction

Data Overview

Report

Di He

Senior Data Scientist

Related