☆ 4.6 Article

Preparing lessons: Improve knowledge distillation with better supervision

NEUROCOMPUTING (2021)

期刊

NEUROCOMPUTING

卷 454, 期 -, 页码 25-33

出版社

ELSEVIER

DOI: 10.1016/j.neucom.2021.04.102

关键词

Knowledge distillation; Label regularization; Hard example mining

类别

Computer Science, Artificial Intelligence

资金

NSFC [61732008, 61772407]
WorldClass Universities (Disciplines) and the Characteristic Development Guidance Funds for the Central Universities [PY3A022]

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

Knowledge distillation is a widely used method for training efficient neural networks, where a compact model is trained to mimic the representation of a cumbersome model to achieve better performance. This paper introduces two novel approaches (LA and DTD) to deal with incorrect and uncertain logits, improving upon the traditional KD approach.

Knowledge distillation (KD) is widely applied in the training of efficient neural network. A compact model, which is trained to mimic the representation of a cumbersome model for the same task, generally obtains a better performance compared with being trained with the ground truth label. Previous KDbased works mainly focus on two aspects: (1) designing various feature representation for knowledge transfer; (2) introducing different training mechanism such as progressive learning or adversarial learning. In this paper, we revisit the standard KD and observe that training with teacher's logits might suffer from incorrect and uncertain supervision. To tackle these problems, we propose two novel approaches to deal with incorrect logits and uncertain logits respectively, which are called Logits Adjustment (LA) and Dynamic Temperature Distillation (DTD). To be specific, LA rectifies the incorrect logits according to ground truth label and certain rules. While DTD treats the temperature of KD as a dynamic sample wise parameter rather than a static and global hyper-parameter, which actually notes the uncertainty for each sample's logits. With iteratively updating the sample wise temperature, the student model could pay more attention on the samples that confuse the teacher model. Experiments on CIFAR-10/100, CINIC10 and Tiny ImageNet verify that the proposed methods yield encouraging improvement compared with the standard KD. Furthermore, considering the simple implementations, LA and DTD can be easily attached to many KD-based frameworks and bring improvements without extra cost of training time and computing resources. (c) 2021 Published by Elsevier B.V.

作者

我是这篇论文的作者

点击您的名字以认领此论文并将其添加到您的个人资料中。

主要评分

4.6

评分不足

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

Semantic Segmentation Using Pixel-Wise Adaptive Label Smoothing via Self-Knowledge Distillation for Limited Labeling Data

Sangyong Park, Jaeseon Kim, Yong Seok Heo

Summary: This study proposes a new regularization method called PALS, which uses self-knowledge distillation to train semantic segmentation networks with limited training data. The method utilizes internal statistics of pixels to generate pixel-wise aggregated probability distributions for increased accuracy. Experimental results show that compared to previous methods, this approach achieves more accurate results with limited training data.

SENSORS (2022)