FLStore: Efficient Federated Learning Storage for non-training workloads
Journal:
arXiv
Published Date:
Mar 1, 2025
Abstract
Federated Learning (FL) is an approach for privacy-preserving Machine
Learning (ML), enabling model training across multiple clients without
centralized data collection. With an aggregator server coordinating training,
aggregating model updates, and storing metadata across rounds. In addition to
training, a substantial part of FL systems are the non-training workloads such
as scheduling, personalization, clustering, debugging, and incentivization.
Most existing systems rely on the aggregator to handle non-training workloads
and use cloud services for data storage. This results in high latency and
increased costs as non-training workloads rely on large volumes of metadata,
including weight parameters from client updates, hyperparameters, and
aggregated updates across rounds, making the situation even worse. We propose
FLStore, a serverless framework for efficient FL non-training workloads and
storage. FLStore unifies the data and compute planes on a serverless cache,
enabling locality-aware execution via tailored caching policies to reduce
latency and costs. Per our evaluations, compared to cloud object store based
aggregator server FLStore reduces per request average latency by 71% and costs
by 92.45%, with peak improvements of 99.7% and 98.8%, respectively. Compared to
an in-memory cloud cache based aggregator server, FLStore reduces average
latency by 64.6% and costs by 98.83%, with peak improvements of 98.8% and
99.6%, respectively. FLStore integrates seamlessly with existing FL frameworks
with minimal modifications, while also being fault-tolerant and highly
scalable.