PLD: A Choice-Theoretic List-Wise Knowledge Distillation
Journal:
arXiv
Published Date:
Jun 14, 2025
Abstract
Knowledge distillation is a model compression technique in which a compact
"student" network is trained to replicate the predictive behavior of a larger
"teacher" network. In logit-based knowledge distillation it has become the de
facto approach to augment cross-entropy with a distillation term. Typically
this term is either a KL divergence-matching marginal probabilities or a
correlation-based loss capturing intra- and inter-class relationships but in
every case it sits as an add-on to cross-entropy with its own weight that must
be carefully tuned. In this paper we adopt a choice-theoretic perspective and
recast knowledge distillation under the Plackett-Luce model by interpreting
teacher logits as "worth" scores. We introduce Plackett-Luce Distillation
(PLD), a weighted list-wise ranking loss in which the teacher model transfers
knowledge of its full ranking of classes, weighting each ranked choice by its
own confidence. PLD directly optimizes a single teacher-optimal ranking of the
true label first, followed by the remaining classes in descending teacher
confidence, yielding a convex, translation-invariant surrogate that subsumes
weighted cross-entropy. Empirically on standard image classification
benchmarks, PLD improves Top-1 accuracy by an average of +0.42% over DIST
(arXiv:2205.10536) and +1.04% over KD (arXiv:1503.02531) in homogeneous
settings and by +0.48% and +1.09% over DIST and KD, respectively, in
heterogeneous settings.