Avoiding Overfitting in Variable-Order Markov Models: a Cross-Validation Approach
Journal:
arXiv
Published Date:
Jan 24, 2025
Abstract
Higher$\text{-}$order Markov chain models are widely used to represent agent
transitions in dynamic systems, such as passengers in transport networks. They
capture transitions in complex systems by considering not only the current
state but also the path of previously visited states. For example, the
likelihood of train passengers traveling from Paris (current state) to Rome
could increase significantly if their journey originated in Italy (prior
state). Although this approach provides a more faithful representation of the
system than first$\text{-}$order models, we find that commonly used
methods$-$relying on Kullback$\text{-}$Leibler divergence$-$frequently overfit
the data, mistaking fluctuations for higher$\text{-}$order dependencies and
undermining forecasts and resource allocation. Here, we introduce DIVOP
(Detection of Informative Variable$\text{-}$Order Paths), an algorithm that
employs cross$\text{-}$validation to robustly distinguish meaningful
higher$\text{-}$order dependencies from noise. In both synthetic and
real$\text{-}$world datasets, DIVOP outperforms two
state$\text{-}$of$\text{-}$the$\text{-}$art algorithms by achieving higher
precision, recall, and sparser representations of the underlying dynamics. When
applied to global corporate ownership data, DIVOP reveals that tax havens
appear in 82$\%$ of all significant higher$\text{-}$order dependencies,
underscoring their outsized influence in corporate networks. By mitigating
overfitting, DIVOP enables more reliable multi$\text{-}$step predictions and
decision$\text{-}$making, paving the way toward deeper insights into the hidden
structures that drive modern interconnected systems.