SimCLR — A Simple Framework For Contrastive Learning of Visual Representation

Rittika Sur
4 min readAug 16, 2020

Over the years various new approaches for self-supervised learning are being introduced, each one better than the previous one, but none of them had a performance better than supervised approaches.

Recent NLP models show that pretraining on a large unlabelled dataset and then fine-tuning them with a small amount of labeled data accelerates the accuracy. A similar approach has great potential to improve performance in the computer vision field.

Google Brain researchers in their research paper “SimCLR: A Simple Framework for Contrastive Learning of Visual Representations”, not only improves upon the previous state-of-the-art self-supervised learning methods but also beats the supervised learning method on ImageNet classification. SimCLR is not a new framework or a new architecture for deep learning, it is a set of fixed steps that can be used to get a good quality of image embeddings with improved accuracy over a classification process.

How does it work?

The model works on 2 phases, self-supervised learning to create an image representation and fine-tuning the model with labeled data.

Overview — Data points from the original unlabelled data is augmented twice generating 2 representation of the original data, then a base encoder(ResNet 50) is pre-trained using augmented data with non-linear layers on top of it using contrastive loss. This pretraining creates an image representation which is then used to train(fine-tuning) a linear fully connected layer using a very small amount of labeled data for a downstream task like image classification.

Complete SimCLR architecture

Self-supervised learning — For pertaining, SimCLR uses ResNet-50 as a base encoder network, it learns the generic representations of images on an unlabeled dataset which augmented twice using contrastive loss. This results in a 2048-dimensional representation of the image.

Data augmentation — At first, the network creates minibatch of N data points which are drawn randomly from the unlabelled dataset. The images are augmented twice using simple augmentation techniques (random cropping followed by resizing back to the original size, random color distortions, and random Gaussian blur) creating two sets of corresponding views. The contrastive prediction task on pairs of augmented examples derived from the minibatch, resulting in 2N data points.
A random image is drawn along with its augmented representations, this forms the positive pair. Given a positive pair, the other 2(N-1) augmented examples within a minibatch are considered as negative examples.

Contrastive loss — helps to learn a representation by simultaneously maximizing the gap(minimizing agreement) between the representation of different images whereas minimizing the gap (maximizing agreement)between the representation of the same image. This method is called contrastive learning. In this paper, cosine similarity is used to calculate the distance or the gap between the representation of similar and non-similar views. In the paper, they call this loss function, NT-Xent.

Contrastive loss focusses on maximizing the agreement between the representation of the same image.

Projection Head — The representation from the base encoder, is given to a non-linear projection of the image representation using a fully-connected network (i.e., Multi-Layer Perceptron ), which amplifies the invariant features and maximizes the ability of the network to identify different transformations of the same image. The non-linear projection is 2 layers dense which projects the representation to a 128-dimensional latent space. These non-linear layers are called the projection head.

Supervised Fine-Tuning — After the pretraining, either the direct representation of the image from the CNN can be used or we can fine-tune them using some labeled data over a fully connected linear layer. This is where transfer learning comes into play. For fine-tuning, the representation from the encoder is used instead of the projection head and then they are trained over a small number of labeled data. This fine-tuned network is then used for downstream tasks like image classification.

For fine-tuning, the image representation is taken from the encoder instead of the projection head

Default settings — ResNet-50 as the base encoder network, and a 2-layer MLP projection head to project the representation to a 128-dimensional latent space. As the loss, NT-Xent is used, optimized using LARS with a learning rate of 4.8 (= 0:3 BatchSize=256) and weight decay of 10^-6. Each batch is trained at batch size 4096 for 100 epochs.

Performance

A linear classifier trained on self-supervised representations learned by Sim-
CLR achieves 76.5% top-1 accuracy, which is a 7% relative improvement over the previous SOTA, matching the performance of a supervised ResNet-50. When fine-tuned on only 1% of the labels, 85.8% top-5 accuracy is achieved, outperforming AlexNet with 100 times fewer labels.

Key Findings

  1. Data augmentation holds a key role in defining effective predictive tasks.
  2. The introduction of non-linearity between the representation and contrastive loss improves the performance of linear classifiers, trained on the SimCLR-learned representation by more than 10%.
  3. Contrastive learning benefits from larger batch sizes and more training steps compared to supervised learning.

Reference

Have a look at my website https://tranzposingradient.wordpress.com/

The blog — https://tranzposingradient.wordpress.com/2020/11/13/simclr-a-simple-framework-for-contrastive-learning-of-visual-representation/

--

--