Contribution of Structure Learning Algorithms in Social Epidemiology: Application to Real-World Data.

Journal: International journal of environmental research and public health
PMID:

Abstract

Epidemiologists often handle large datasets with numerous variables and are currently seeing a growing wealth of techniques for data analysis, such as machine learning. Critical aspects involve addressing causality, often based on observational data, and dealing with the complex relationships between variables to uncover the overall structure of variable interactions, causal or not. Structure learning (SL) methods aim to automatically or semi-automatically reveal the structure of variables' relationships. The objective of this study is to delineate some of the potential contributions and limitations of structure learning methods when applied to social epidemiology topics and the search for determinants of healthcare system access. We applied SL techniques to a real-world dataset, namely the 2010 wave of the SIRS cohort, which included a sample of 3006 adults from the Paris region, France. Healthcare utilization, encompassing both direct and indirect access to care, was the primary outcome. Candidate determinants included health status, demographic characteristics, and socio-cultural and economic positions. We present two approaches: a non-automated epidemiological method (an initial expert knowledge network and stepwise logistic regression models) and three SL techniques using various algorithms, with and without knowledge constraints. We compared the results based on the presence, direction, and strength of specific links within the produced network. Although the interdependencies and relative strengths identified by both approaches were similar, the SL algorithms detect fewer associations with the outcome than the non-automated method. Relationships between variables were sometimes incorrectly oriented when using a purely data-driven approach. SL algorithms can be valuable in exploratory stages, helping to generate new hypotheses or mining novel databases. However, results should be validated against prior knowledge and supplemented with additional confirmatory analyses.

Authors

  • Helene Colineaux
    EQUITY Team, Centre d'Epidémiologie et de Recherche en Santé des POPulations (CERPOP), Institut National de la Santé et de la Recherche Médicale (INSERM)-Toulouse III University, 37 Allées Jules Guesde, 31062 Toulouse, France.
  • Benoit Lepage
    Medical Information Department, University Hospital of Toulouse, 31059 Toulouse, France. Electronic address: lepage.b@chu-toulouse.fr.
  • Pierre Chauvin
    UMRS 1136, Pierre Louis Institute of Epidemiology and Public Health, Department of Social Epidemiology, Institut National de la Santé et de la Recherche Médicale (INSERM), Sorbonne University, 75005 Paris, France.
  • Chloe Dimeglio
    Toulouse Institute for Infectious and Inflammatory Diseases (INFINITY), Institut National de la Santé et de la Recherche Médicale (INSERM), UMR 1291, Centre National de la Recherche Scientifique (CNRS), UMR 5051, 31300 Toulouse, France.
  • Cyrille Delpierre
    EQUITY Team, Centre d'Epidémiologie et de Recherche en Santé des POPulations (CERPOP), Institut National de la Santé et de la Recherche Médicale (INSERM)-Toulouse III University, 37 Allées Jules Guesde, 31062 Toulouse, France.
  • Thomas Lefèvre
    Hôpital Jean-Verdier (AP-HP), Department of Forensic Science and Medicine, F-93140 Bondy, France; IRIS - Institut de recherches interdisciplinaires sur les enjeux sociaux (UMR 8156-723), Bobigny, France. Electronic address: thomas.lefevre@univ-paris13.fr.