Sorting the Babble in Babel: Assessing the Performance of Language Detection Algorithms on the OpenAlex Database
Journal:
arXiv
Published Date:
Feb 5, 2025
Abstract
This project aims to compare various language classification procedures,
procedures combining various Python language detection algorithms and
metadata-based corpora extracted from manually-annotated articles sampled from
the OpenAlex database. Following an analysis of precision and recall
performance for each algorithm, corpus, and language as well as of processing
speeds recorded for each algorithm and corpus type, overall procedure
performance at the database level was simulated using probabilistic confusion
matrices for each algorithm, corpus, and language as well as a probabilistic
model of relative article language frequencies for the whole OpenAlex database.
Results show that procedure performance strongly depends on the importance
given to each of the measures implemented: for contexts where precision is
preferred, using the LangID algorithm on the greedy corpus gives the best
results; however, for all cases where recall is considered at least slightly
more important than precision or as soon as processing times are given any kind
of consideration, the procedure combining the FastSpell algorithm and the
Titles corpus outperforms all other alternatives. Given the lack of truly
multilingual, large-scale bibliographic databases, it is hoped that these
results help confirm and foster the unparalleled potential of the OpenAlex
database for cross-linguistic, bibliometric-based research and analysis.