Variable selection in social-environmental data: sparse regression and tree ensemble machine learning approaches.

Journal: BMC medical research methodology
Published Date:

Abstract

BACKGROUND: Social-environmental data obtained from the US Census is an important resource for understanding health disparities, but rarely is the full dataset utilized for analysis. A barrier to incorporating the full data is a lack of solid recommendations for variable selection, with researchers often hand-selecting a few variables. Thus, we evaluated the ability of empirical machine learning approaches to identify social-environmental factors having a true association with a health outcome.

Authors

  • Elizabeth Handorf
    Biostatistics and Bioinformatics Facility, Fox Chase Cancer Center, Reimann 383, 333 Cottman Ave, Philadelphia, PA, 19111, USA. Elizabeth.Handorf@fccc.edu.
  • Yinuo Yin
    Cancer Prevention and Control, Fox Chase Cancer Center, Young Pavilion, 333 Cottman Ave, Philadelphia, PA, 19111, USA.
  • Michael Slifker
    Biostatistics and Bioinformatics Facility, Fox Chase Cancer Center, Reimann 383, 333 Cottman Ave, Philadelphia, PA, 19111, USA.
  • Shannon Lynch
    Cancer Prevention and Control, Fox Chase Cancer Center, Young Pavilion, 333 Cottman Ave, Philadelphia, PA, 19111, USA.