- One simple way to improve the performance of machine learning: Ensemble, but it has the disadvantage of high time latency and computational cost.
- It propose a way to distill knowledge from once-trained large-scale machine learning (or models) to a small model
- Not only focus on the highest value, but also pay attention to other values
- In the category of BMW, truck, carrot, if the actual label is BMW
- It's less likely to be classified as a truck, but it's probably higher than a carrot!
- Use class probability as a soft target for small models to learn, which is the result of the Cumbersome model
- Since it is a high entropy, there is more information than hard targets used for general learning.
- Since the variation of gradients between training gradients is small, learning is possible efficiently even with data with little small model.
- Making small models perform well using the results of cumbesome
- T=1 is normal softmax, but introduces a new parameter called 'Temperature'
- High T is used for transferring knowledge in cumbesome, and T=1 is used for small models (the bigger the T, the softer it becomes)
- Cross entropy for correct label in small model (generally used in practice)
- The above two losses are used as loss functions. Generally, the front is given large and the back is given small.
- This is because the robustness of the model can be improved by giving less weight to the Hard label.
- Less sensitive to false labels and more responsive to noise.
- v: combersome, p: soft target
- When the Cross entropy is differentiated by logit(z), it is as follows.
- At this time, if T is greater than logit, it is possible to approximate exp = 1 + ε.
- If distillation is done well, it can be assumed that logit has zero-mean.
- So the sigma can be ignored.
- If the temperature is small, the distribution function of the soft target decreases to a degree that is close to one-hot encoding.
- Decrease the difference between negative logs
- If the temperature is large, it becomes soft and the value for the negative logit of the soft target is larger
- So the difference from the negative logit of the distributed model is larger.
- If there are more than 300 units per layer, all similar results in t ≥ 8 in MNIST.
- If the number of units is drastically reduced to about 30 per layer, it shows optimal performance at 2.5 ≤ t ≤ 4.
- The accuracy of the model using distill is observed, and the word error is also low. in speech recognition.
@article{DBLP:journals/corr/HintonVD15,
author = {Geoffrey E. Hinton and
Oriol Vinyals and
Jeffrey Dean},
title = {Distilling the Knowledge in a Neural Network},
journal = {CoRR},
volume = {abs/1503.02531},
year = {2015},
url = {http://arxiv.org/abs/1503.02531},
eprinttype = {arXiv},
eprint = {1503.02531},
timestamp = {Mon, 13 Aug 2018 16:48:36 +0200},
biburl = {https://dblp.org/rec/journals/corr/HintonVD15.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}




