"In recent years, a number of approaches based on 2D CNNs 2 and 3D CNNs have emerged for video action recogni3 tion, achieving state-of-the-art results on several large-scale 4 benchmark datasets. In this paper, we carry out in-depth com5 parative analysis to better understand the differences between 6 these approaches and the progress made by them. To this end, 7 we develop a unified framework for both 2D-CNN and 3D8 CNN action models, which enables us to remove bells and 9 whistles and provides a common ground for fair comparison. 10 We then conduct an effort towards a large-scale analysis in11 volving over 300 action recognition models. Our comprehen12 sive analysis reveals that a) a significant leap is made in effi13 ciency for action recognition, but not in accuracy; b) 2D-CNN 14 and 3D-CNN models behave similarly in terms of spatio15 temporal representation abilities and transferability. Our anal16 ysis also shows that recent action models seem to be able 17 to learn data-dependent temporality flexibly as needed. Our 18 codes and models will be publicly available."