The sociolinguistic foundations of language modeling.

Journal: Frontiers in artificial intelligence
Published Date:

Abstract

In this article, we introduce a sociolinguistic perspective on language modeling. We claim that language models in general are inherently modeling , and we consider how this insight can inform the development and deployment of language models. We begin by presenting a technical definition of the concept of a variety of language as developed in sociolinguistics. We then discuss how this perspective could help us better understand five basic challenges in language modeling: , and . We argue that to maximize the performance and societal value of language models it is important to carefully compile training corpora that accurately represent the specific varieties of language being modeled, drawing on theories, methods, and descriptions from the field of sociolinguistics.

Authors

  • Jack Grieve
    Department of Linguistics and Communication, University of Birmingham, Birmingham, United Kingdom.
  • Sara Bartl
    Department of Linguistics and Communication, University of Birmingham, Birmingham, United Kingdom.
  • Matteo Fuoli
    Department of Linguistics and Communication, University of Birmingham, Birmingham, United Kingdom.
  • Jason Grafmiller
    Department of Linguistics and Communication, University of Birmingham, Birmingham, United Kingdom.
  • Weihang Huang
    Department of Linguistics and Communication, University of Birmingham, Birmingham, United Kingdom.
  • Alejandro Jawerbaum
    Department of Linguistics and Communication, University of Birmingham, Birmingham, United Kingdom.
  • Akira Murakami
    Department of Linguistics and Communication, University of Birmingham, Birmingham, United Kingdom.
  • Marcus Perlman
    Department of Linguistics and Communication, University of Birmingham, Birmingham, United Kingdom.
  • Dana Roemling
    Department of Linguistics and Communication, University of Birmingham, Birmingham, United Kingdom.
  • Bodo Winter
    Department of Linguistics and Communication, University of Birmingham, Birmingham, United Kingdom.

Keywords

No keywords available for this article.