It’s a way of training a small network using the knowledge in a trained larger network; i.e. distilling the knowledge from the large network. A

Distilling the Knowledge in a Neural Network

submited by
Style Pass
2021-07-03 09:30:05

It’s a way of training a small network using the knowledge in a trained larger network; i.e. distilling the knowledge from the large network.

A large model with regularization or an ensemble of models (using dropout) generalizes better than a small model when trained directly on the data and labels. However, a small model can be trained to generalize better with help of a large model. Smaller models are better in production: faster, less compute, less memory.

The output probabilities of a trained model give more information than the labels because it assigns non-zero probabilities to incorrect classes as well. These probabilities tell us that a sample has a chance of belonging to certain classes. For instance, when classifying digits, when given an image of digit 7, a generalized model will give a high probability to 7 and a small but non-zero probability to 2, while assigning almost zero probability to other digits. Distillation uses this information to train a small model better.

We train the small model to minimize the Cross entropy or KL Divergence between its output probability distribution and the large network’s output probability distribution (soft targets).

Leave a Comment