Swapped Logit Distillation via Bi-level Teacher Alignment
Journal:
arXiv
Published Date:
Apr 27, 2025
Abstract
Knowledge distillation (KD) compresses the network capacity by transferring
knowledge from a large (teacher) network to a smaller one (student). It has
been mainstream that the teacher directly transfers knowledge to the student
with its original distribution, which can possibly lead to incorrect
predictions. In this article, we propose a logit-based distillation via swapped
logit processing, namely Swapped Logit Distillation (SLD). SLD is proposed
under two assumptions: (1) the wrong prediction occurs when the prediction
label confidence is not the maximum; (2) the "natural" limit of probability
remains uncertain as the best value addition to the target cannot be
determined. To address these issues, we propose a swapped logit processing
scheme. Through this approach, we find that the swap method can be effectively
extended to teacher and student outputs, transforming into two teachers. We
further introduce loss scheduling to boost the performance of two teachers'
alignment. Extensive experiments on image classification tasks demonstrate that
SLD consistently performs best among previous state-of-the-art methods.