Journal
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS
Volume 33, Issue 1, Pages 257-269Publisher
IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/TNNLS.2020.3027634
Keywords
Training; Nonlinear distortion; Data models; Neural networks; Knowledge engineering; Network architecture; Generalization error; network compression; representation invariance; self-distillation (SD)
Categories
Funding
- Major Project for New Generation of Artificial Intelligence (AI) [2018AAA0100400]
- National Natural Science Foundation of China (NSFC) [61836014, 61721004]
- Ministry of Science and Technology of China
Ask authors/readers for more resources
This article proposes an elegant self-distillation mechanism to directly obtain high-accuracy models without the need for an assistive model. It learns data representation invariance and effectively reduces generalization errors for various network architectures, surpassing existing model distillation methods with little extra training efforts.
To harvest small networks with high accuracies, most existing methods mainly utilize compression techniques such as low-rank decomposition and pruning to compress a trained large model into a small network or transfer knowledge from a powerful large model (teacher) to a small network (student). Despite their success in generating small models of high performance, the dependence of accompanying assistive models complicates the training process and increases memory and time cost. In this article, we propose an elegant self-distillation (SD) mechanism to obtain high-accuracy models directly without going through an assistive model. Inspired by the invariant recognition in the human vision system, different distorted instances of the same input should possess similar high-level data representations. Thus, we can learn data representation invariance between different distorted versions of the same sample. Especially, in our learning algorithm based on SD, the single network utilizes the maximum mean discrepancy metric to learn the global feature consistency and the Kullback-Leibler divergence to constrain the posterior class probability consistency across the different distorted branches. Extensive experiments on MNIST, CIFAR-10/100, and ImageNet data sets demonstrate that the proposed method can effectively reduce the generalization error for various network architectures, such as AlexNet, VGGNet, ResNet, Wide ResNet, and DenseNet, and outperform existing model distillation methods with little extra training efforts.
Authors
I am an author on this paper
Click your name to claim this paper and add it to your profile.
Reviews
Recommended
No Data Available