About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
CVPR 2021
Conference paper
Deep Analysis of CNN-based Spatio-temporal Representations for Action Recognition
Abstract
"In recent years, a number of approaches based on 2D CNNs 2 and 3D CNNs have emerged for video action recogni3 tion, achieving state-of-the-art results on several large-scale 4 benchmark datasets. In this paper, we carry out in-depth com5 parative analysis to better understand the differences between 6 these approaches and the progress made by them. To this end, 7 we develop a unified framework for both 2D-CNN and 3D8 CNN action models, which enables us to remove bells and 9 whistles and provides a common ground for fair comparison. 10 We then conduct an effort towards a large-scale analysis in11 volving over 300 action recognition models. Our comprehen12 sive analysis reveals that a) a significant leap is made in effi13 ciency for action recognition, but not in accuracy; b) 2D-CNN 14 and 3D-CNN models behave similarly in terms of spatio15 temporal representation abilities and transferability. Our anal16 ysis also shows that recent action models seem to be able 17 to learn data-dependent temporality flexibly as needed. Our 18 codes and models will be publicly available."