Simple cumulative weighting of routine surveillance data identifies epidemic wave origins more accurately than a large language model: evidence from eight COVID-19 waves in Japan
Journal:
medRxiv
Published Date:
Jun 3, 2026
Abstract
Identifying the origin of an emerging epidemic wave within days of onset could enable targeted response before national spread, yet current methods rely on genomic sequencing that lags clinical detection by 2-4 weeks. We analysed daily COVID-19 cases from Japan's 47 prefectures across eight waves (2020-2023), aggregated into 11 regional blocks. Wave onset was defined by the first difference of the K-value (K'). Six surveillance indicators were evaluated with and without cumulative historical weighting ({lambda} = 0.75) and benchmarked against a large language model (Claude Haiku), scored by F1 against genomically confirmed origins. At 14 days after onset, cumulative weighting of peak and cumulative incidence (B1+prior, B3+prior) reached mean F1 = 0.622, exceeding the model (0.524); the gap was largest in Wave 7 (1.000 vs 0.333). Simple cumulative weighting of routine surveillance data identified wave origins more accurately than a language model, without proprietary tools or sequencing.