Just-in-Time Aggregation for Federated Learning
Abstract
The increasing number and scale of federated learning (FL) jobs necessitates resource efficient scheduling and management of aggregation to make the economics of cloud-hosted aggregation work. Existing FL research has focused on the design of FL algorithms and optimization, and less on aggregation efficacy. In this paper, we propose a new FL aggregation paradigm - 'just-in-time' (JIT) aggregation that leverages unique properties of FL jobs, especially the periodicity of model updates, to defer aggregation as much as possible and free compute resources for other FL jobs or other datacenter workloads. We describe a novel way to prioritize FL jobs for aggregation, and demonstrate using multiple datasets, models and FL aggregation algorithms that our techniques can reduce resource usage by 60+% when compared to eager aggregation used in existing FL platforms. We demonstrate that using JIT aggregation has negligible overhead and impact on the latency of the FL job.