About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
ICDM 2017
Conference paper
GaDei: On scale-up training as a service for deep learning
Abstract
Deep learning (DL) training-as-a-service (TaaS) is an important emerging industrial workload. TaaS must satisfy a wide range of customers who have no experience and/or resources to tune DL hyper-parameters (e.g., mini-batch size and learning rate), and meticulous tuning for each user's dataset is prohibitively expensive. Therefore, TaaS hyper-parameters must be fixed with values that are applicable to all users. Unfortunately, few research papers have studied how to design a system for TaaS workloads. By evaluating the IBM Watson Natural Language Classfier (NLC) workloads, the most popular IBM cognitive service used by thousands of enterprise-level clients globally, we provide empirical evidence that only the conservative hyper-parameter setup (e.g., small mini-batch size) can guarantee acceptable model accuracy for a wide range of customers. Unfortunately, smaller mini-batch size requires higher communication bandwidth in a parameter-server based DL training system. In this paper, we characterize the exceedingly high communication bandwidth requirement of TaaS using representative industrial deep learning workloads. We then present GaDei, a highly optimized shared-memory based scale-up parameter server design. We evaluate GaDei using both commercial benchmarks and public benchmarks and demonstrate that GaDei significantly outperforms the state-of-the-art parameter-server based implementation while maintaining the required accuracy. GaDei achieves near-best-possible runtime performance, constrained only by the hardware limitation. Furthermore, to the best of our knowledge, GaDei is the only scale-up DL system that provides fault-tolerance.