Ranking-based Fusion Algorithms for Extreme Multi-label Text Classification (XMTC)

Journal: arXiv

Published Date: Jul 4, 2025

Abstract

In the context of Extreme Multi-label Text Classification (XMTC), where labels are assigned to text instances from a large label space, the long-tail distribution of labels presents a significant challenge. Labels can be broadly categorized into frequent, high-coverage \textbf{head labels} and infrequent, low-coverage \textbf{tail labels}, complicating the task of balancing effectiveness across all labels. To address this, combining predictions from multiple retrieval methods, such as sparse retrievers (e.g., BM25) and dense retrievers (e.g., fine-tuned BERT), offers a promising solution. The fusion of \textit{sparse} and \textit{dense} retrievers is motivated by the complementary ranking characteristics of these methods. Sparse retrievers compute relevance scores based on high-dimensional, bag-of-words representations, while dense retrievers utilize approximate nearest neighbor (ANN) algorithms on dense text and label embeddings within a shared embedding space. Rank-based fusion algorithms leverage these differences by combining the precise matching capabilities of sparse retrievers with the semantic richness of dense retrievers, thereby producing a final ranking that improves the effectiveness across both head and tail labels.

Authors

Celso França
Gestefane Rabbi
Thiago Salles
Washington Cunha
Leonardo Rocha
Marcos André Gonçalves

External Resources

View on arXiv arXiv (http://arxiv.org/abs/2507.03761v1)

Ranking-based Fusion Algorithms for Extreme Multi-label Text Classification (XMTC)

Abstract

Authors

Categories

External Resources

Popular Topics

Recent Journals

Ranking-based Fusion Algorithms for Extreme Multi-label Text Classification (XMTC)

Abstract

Authors

Categories

External Resources

Don't Miss the Future of Medicine

Popular Topics

Recent Journals