Action recognition is an important problem in computer vision and has received substantial attention in recent years. However, it remains very challenging due to the complex interaction of static and dynamic information, as well as the high computational cost of processing video data. This paper aims to apply the success of static image semantic recognition to the video domain, by leveraging both static and motion based descriptors in different stages of the semantic ladder. We examine the effects of three types of features: low-level dynamic descriptors, intermediate-level static deep architecture outputs, and static high-level semantics. In order to combine such heterogeneous sources of information, we employ a scalable method to fuse these features. Through extensive experimental evaluations, we demonstrate that the proposed framework significantly improves action classification performance. We have obtained an accuracy of 89.59% and 62.88% on the well-known UCF-101 and HMDB-51 benchmarks, respectively, which compare favorably with the state-of-the-art.