Personalized online ensemble machine learning with applications for dynamic data streams.

Journal: Statistics in medicine
Published Date:

Abstract

In this work we introduce the personalized online super learner (POSL), an online personalizable ensemble machine learning algorithm for streaming data. POSL optimizes predictions with respect to baseline covariates, so personalization can vary from completely individualized, that is, optimization with respect to subject ID, to many individuals, that is, optimization with respect to common baseline covariates. As an online algorithm, POSL learns in real time. As a super learner, POSL is grounded in statistical optimality theory and can leverage a diversity of candidate algorithms, including online algorithms with different training and update times, fixed/offline algorithms that are not updated during POSL's fitting procedure, pooled algorithms that learn from many individuals' time series, and individualized algorithms that learn from within a single time series. POSL's ensembling of the candidates can depend on the amount of data collected, the stationarity of the time series, and the mutual characteristics of a group of time series. Depending on the underlying data-generating process and the information available in the data, POSL is able to adapt to learning across samples, through time, or both. For a range of simulations that reflect realistic forecasting scenarios and in a medical application, we examine the performance of POSL relative to other current ensembling and online learning methods. We show that POSL is able to provide reliable predictions for both short and long time series, and it's able to adjust to changing data-generating environments. We further cultivate POSL's practicality by extending it to settings where time series dynamically enter and exit.

Authors

  • Ivana Malenica
    Division of biostatistics, School of Public Health, university of California Berkeley, CA, USA.
  • Rachael V Phillips
    Divisions of Epidemiology & Biostatistics, University of California, Berkeley, Berkeley, California, USA.
  • Antoine Chambaz
    MAP5 (UMR CNRS 8145), université Paris Descartes, 75006 Paris, France.
  • Alan E Hubbard
  • Romain Pirracchio
  • Mark J van der Laan
    Division of Biostatistics, School of Public Health, University of California, Berkeley, CA, USA.