Data and Model Biases in Social Media Analyses: A Case Study of COVID-19 Tweets.

Journal: AMIA ... Annual Symposium proceedings. AMIA Symposium
Published Date:

Abstract

During the coronavirus disease pandemic (COVID-19), social media platforms such as Twitter have become a venue for individuals, health professionals, and government agencies to share COVID-19 information. Twitter has been a popular source of data for researchers, especially for public health studies. However, the use of Twitter data for research also has drawbacks and barriers. Biases appear everywhere from data collection methods to modeling approaches, and those biases have not been systematically assessed. In this study, we examined six different data collection methods and three different machine learning (ML) models-commonly used in social media analysis-to assess data collection bias and measure ML models' sensitivity to data collection bias. We showed that (1) publicly available Twitter data collection endpoints with appropriate strategies can collect data that is reasonably representative of the Twitter universe; and (2) careful examinations of ML models' sensitivity to data collection bias are critical.

Authors

  • Yunpeng Zhao
    University of Florida, Gainesville, Florida, USA.
  • Pengfei Yin
    University of Florida, Gainesville, Florida, USA.
  • Yongqiu Li
    University of Florida, Gainesville, Florida, USA.
  • Xing He
    University of Florida, Gainesville, Florida, USA.
  • Jingcheng Du
    University of Texas Health Science Center at Houston, Houston, Texas, USA.
  • Cui Tao
    The University of Texas Health Science Center at Houston, USA.
  • Yi Guo
    Department of Respiratory and Critical Care Medicine, Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai 200025, China.
  • Mattia Prosperi
    University of Florida, Gainesville, Florida, USA.
  • Pierangelo Veltri
    Magna Graecia University of Catanzaro, Catanzaro, Italy.
  • Xi Yang
    Department of Health Outcomes and Biomedical Informatics.
  • Yonghui Wu
    Department of Health Outcomes and Biomedical Informatics.
  • Jiang Bian
    Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, Florida, United States of America.