Structure-based Anomaly Detection and Clustering
Journal:
arXiv
Published Date:
May 19, 2025
Abstract
Anomaly detection is a fundamental problem in domains such as healthcare,
manufacturing, and cybersecurity. This thesis proposes new unsupervised methods
for anomaly detection in both structured and streaming data settings. In the
first part, we focus on structure-based anomaly detection, where normal data
follows low-dimensional manifolds while anomalies deviate from them. We
introduce Preference Isolation Forest (PIF), which embeds data into a
high-dimensional preference space via manifold fitting, and isolates outliers
using two variants: Voronoi-iForest, based on geometric distances, and
RuzHash-iForest, leveraging Locality Sensitive Hashing for scalability. We also
propose Sliding-PIF, which captures local manifold information for streaming
scenarios. Our methods outperform existing techniques on synthetic and real
datasets. We extend this to structure-based clustering with MultiLink, a novel
method for recovering multiple geometric model families in noisy data.
MultiLink merges clusters via a model-aware linkage strategy, enabling robust
multi-class structure recovery. It offers key advantages over existing
approaches, such as speed, reduced sensitivity to thresholds, and improved
robustness to poor initial sampling. The second part of the thesis addresses
online anomaly detection in evolving data streams. We propose Online Isolation
Forest (Online-iForest), which uses adaptive, multi-resolution histograms and
dynamically updates tree structures to track changes over time. It avoids
retraining while achieving accuracy comparable to offline models, with superior
efficiency for real-time applications. Finally, we tackle anomaly detection in
cybersecurity via open-set recognition for malware classification. We enhance a
Gradient Boosting classifier with MaxLogit to detect unseen malware families, a
method now integrated into Cleafy's production system.