K-order Ranking Preference Optimization for Large Language Models
Journal:
arXiv
Published Date:
May 31, 2025
Abstract
To adapt large language models (LLMs) to ranking tasks, existing list-wise
methods, represented by list-wise Direct Preference Optimization (DPO), focus
on optimizing partial-order or full-order list ranking consistency for LLMs to
enhance their ranking abilities. However, we argue that optimizing top-K
ranking consistency could be more appropriate for real-world applications.
There are two main reasons: (1) users are typically concerned with only the
top-K results, making top-K ranking more important, and (2) tail items often
lack precise feedback, making top-K ranking more reliable. Based on this, we
propose K-order Ranking Preference Optimization (KPO) by extending the DPO's
Plackett-Luce model to accommodate top-K rankings. Additionally, recognizing
that the number of important items can vary across queries, we extend KPO to
dynamically determine appropriate K for different samples and introduce a
curriculum learning strategy to boost training efficiency. Extensive
experiments demonstrate the effectiveness of KPO, highlighting its high sample
efficiency and robustness to noise. The code is available at
https://github.com/Lanyu0303/KPO.