Skip to content

Latest commit

 

History

History
74 lines (63 loc) · 3.83 KB

File metadata and controls

74 lines (63 loc) · 3.83 KB

Knowledge Distillation

  • Title: Distilling the Knowledge in a Neural Network
  • Publication: NIPS, 2014
  • Link: [paper] [code]

Abstract

  • One simple way to improve the performance of machine learning: Ensemble, but it has the disadvantage of high time latency and computational cost.
  • It propose a way to distill knowledge from once-trained large-scale machine learning (or models) to a small model

Distillation

Mechanism of Knowledge distillation

  • Not only focus on the highest value, but also pay attention to other values
    • In the category of BMW, truck, carrot, if the actual label is BMW
    • It's less likely to be classified as a truck, but it's probably higher than a carrot!
  • Use class probability as a soft target for small models to learn, which is the result of the Cumbersome model
  • Since it is a high entropy, there is more information than hard targets used for general learning.
  • Since the variation of gradients between training gradients is small, learning is possible efficiently even with data with little small model.

soft label

  • Making small models perform well using the results of cumbesome
  • T=1 is normal softmax, but introduces a new parameter called 'Temperature'
    • High T is used for transferring knowledge in cumbesome, and T=1 is used for small models (the bigger the T, the softer it becomes)

img1

hard label

  • Cross entropy for correct label in small model (generally used in practice)

distillation loss

  • The above two losses are used as loss functions. Generally, the front is given large and the back is given small.
  • This is because the robustness of the model can be improved by giving less weight to the Hard label.
    • Less sensitive to false labels and more responsive to noise.

img1

Matching logits is a special case of distillation

  • v: combersome, p: soft target
  • When the Cross entropy is differentiated by logit(z), it is as follows.

img1

  • At this time, if T is greater than logit, it is possible to approximate exp = 1 + ε.

img1

  • If distillation is done well, it can be assumed that logit has zero-mean.
  • So the sigma can be ignored.

img1

  • If the temperature is small, the distribution function of the soft target decreases to a degree that is close to one-hot encoding.
    • Decrease the difference between negative logs
  • If the temperature is large, it becomes soft and the value for the negative logit of the soft target is larger
    • So the difference from the negative logit of the distributed model is larger.

Experiment

  • If there are more than 300 units per layer, all similar results in t ≥ 8 in MNIST.
    • If the number of units is drastically reduced to about 30 per layer, it shows optimal performance at 2.5 ≤ t ≤ 4.
  • The accuracy of the model using distill is observed, and the word error is also low. in speech recognition.

img1

Reference

@article{DBLP:journals/corr/HintonVD15,
  author       = {Geoffrey E. Hinton and
                  Oriol Vinyals and
                  Jeffrey Dean},
  title        = {Distilling the Knowledge in a Neural Network},
  journal      = {CoRR},
  volume       = {abs/1503.02531},
  year         = {2015},
  url          = {http://arxiv.org/abs/1503.02531},
  eprinttype    = {arXiv},
  eprint       = {1503.02531},
  timestamp    = {Mon, 13 Aug 2018 16:48:36 +0200},
  biburl       = {https://dblp.org/rec/journals/corr/HintonVD15.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}