What is in a name? Mitigating Name Bias in Text Embeddings via Anonymization
Journal:
arXiv
Published Date:
Feb 5, 2025
Abstract
Text-embedding models often exhibit biases arising from the data on which
they are trained. In this paper, we examine a hitherto unexplored bias in
text-embeddings: bias arising from the presence of $\textit{names}$ such as
persons, locations, organizations etc. in the text. Our study shows how the
presence of $\textit{name-bias}$ in text-embedding models can potentially lead
to erroneous conclusions in assessment of thematic similarity.Text-embeddings
can mistakenly indicate similarity between texts based on names in the text,
even when their actual semantic content has no similarity or indicate
dissimilarity simply because of the names in the text even when the texts match
semantically. We first demonstrate the presence of name bias in different
text-embedding models and then propose $\textit{text-anonymization}$ during
inference which involves removing references to names, while preserving the
core theme of the text. The efficacy of the anonymization approach is
demonstrated on two downstream NLP tasks, achieving significant performance
gains. Our simple and training-optimization-free approach offers a practical
and easily implementable solution to mitigate name bias.