Self/Semi-Supervised Image Classification

May 10, 2021

Photo by rawpixel on Unsplash

Introduction

At the end of Deep Learning course taught by Yann LeCun, a final project was assigned. We competed with your classmates on who can find the best self/semi-supervised learning algorithm. We were given a dataset with a large amount of unlabeled data and a small amount of labeled data to train the model, and the final performance of the model will be evaluated on a hidden test set and posted on a public leaderboard.

Data Overview

The dataset, of color images of size 96×96, that has the following structure

512, 000 unlabeled images.
25, 600 labeled training images (32 examples, 800 classes).
25, 600 labeled validation images (32 examples, 800 classes).

Overall, a hidden test set was kept, which will be used to test the models. In order to improve performance when training our model with few labeled samples, we need to make use of the unlabeled data.

Our presentation

Semi-supervised learning incorporates both labeled and unlabeled data points in the training process and aims to use leverage the unlabeled data to learn features that would support modeling process when using the training data for specific tasks. The community has gained popularity as the amount of data required and cost of obtaining human labeled data increased over the years. This video explains the modeling and thought process our team had when tackling the given task, and achieved 50% accuracy on the test set with CoMatch.