Deep learning (DL) training-as-a-service (TaaS) is an important emerging industrial workload. TaaS must satisfy a wide range of customers who have no experience and/or resources to tune DL hyper-parameters (e.g., mini-batch size and learning rate), and meticulous tuning for each user's dataset is prohibitively expensive. Therefore, TaaS hyper-parameters must be fixed with values that are applicable to all users. Unfortunately, few research papers have studied how to design a system for TaaS workloads. By evaluating the IBM Watson Natural Language Classfier (NLC) workloads, the most popular IBM cognitive service used by thousands of enterprise-level clients globally, we provide empirical evidence that only the conservative hyper-parameter setup (e.g., small mini-batch size) can guarantee acceptable model accuracy for a wide range of customers. Unfortunately, smaller mini-batch size requires higher communication bandwidth in a parameter-server based DL training system. In this paper, we characterize the exceedingly high communication bandwidth requirement of TaaS using representative industrial deep learning workloads. We then present GaDei, a highly optimized shared-memory based scale-up parameter server design. We evaluate GaDei using both commercial benchmarks and public benchmarks and demonstrate that GaDei significantly outperforms the state-of-the-art parameter-server based implementation while maintaining the required accuracy. GaDei achieves near-best-possible runtime performance, constrained only by the hardware limitation. Furthermore, to the best of our knowledge, GaDei is the only scale-up DL system that provides fault-tolerance.