End-to-End triplet loss based fine-tuning for network embedding in effective PII detection
Journal:
arXiv
Published Date:
Feb 13, 2025
Abstract
There are many approaches in mobile data ecosystem that inspect network
traffic generated by applications running on user's device to detect personal
data exfiltration from the user's device. State-of-the-art methods rely on
features extracted from HTTP requests and in this context, machine learning
involves training classifiers on these features and making predictions using
labelled packet traces. However, most of these methods include external feature
selection before model training. Deep learning, on the other hand, typically
does not require such techniques, as it can autonomously learn and identify
patterns in the data without external feature extraction or selection
algorithms. In this article, we propose a novel deep learning based end-to-end
learning framework for prediction of exposure of personally identifiable
information (PII) in mobile packets. The framework employs a pre-trained large
language model (LLM) and an autoencoder to generate embedding of network
packets and then uses a triplet-loss based fine-tuning method to train the
model, increasing detection effectiveness using two real-world datasets. We
compare our proposed detection framework with other state-of-the-art works in
detecting PII leaks from user's device.