About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
MM 2013
Conference paper
Learning latent spatio-temporal compositional model for human action recognition
Abstract
Action recognition is an important problem in multimedia under- standing. This paper addresses this problem by building an expres- sive compositional action model. We model one action instance in the video with an ensemble of spatio-temporal compositions: A number of discrete temporal anchor frames, each of which is fur- Ther decomposed to a layout of deformable parts. In this way, our model can identify a Spatio-Temporal And-Or Graph (STAOG) to represent the latent structure of actions e.g. triple jumping, swing- ing and high jumping. The STAOG model comprises four layers: (i) a batch of leaf-nodes in bottom for detecting various action part- s within video patches; (ii) the or-nodes over bottom, i.e. switch variables to activate their children leaf-nodes for structural variabil- ity; (iii) the and-nodes within an anchor frame for verifying spatial composition; and (iv) the root-node at top for aggregating scores over temporal anchor frames. Moreover, the contextual interac- Tions are defined between leaf-nodes in both spatial and temporal domains. For model training, we develop a novel weakly super- vised learning algorithm which iteratively determines the structural configuration (e.g. the production of leaf-nodes associated with the or-nodes) along with the optimization of multi-layer parameters. By fully exploiting spatio-temporal compositions and interactions, our approach handles well large intra-class action variance (e.g. d- ifferent views, individual appearances, spatio-temporal structures). The experimental results on the challenging databases demonstrate superior performance of our approach over other methods. Copyright © 2013 ACM.