About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
IJCAI 2020
Conference paper
Hierarchical attention based spatial-temporal graph-to-sequence learning for grounded video description
Abstract
The task of Grounded Video Description (GVD) is to generate sentences whose objects can be grounded with the bounding boxes in the video frames. Existing works often fail to exploit structural information both in modeling the relationships among the region proposals and in attending them for text generation. To address these issues, we cast the GVD task as a spatial-temporal Graph-to-Sequence learning problem, where we model video frames as spatial-temporal sequence graph in order to better capture implicit structural relationships. In particular, we exploit two ways to construct a sequence graph that captures spatial-temporal correlations among different objects in each frame and further present a novel graph topology refinement technique to discover optimal underlying graph structure. In addition, we also present hierarchical attention mechanism to attend sequence graph in different resolution levels for better generating the sentences. Our extensive experiments demonstrate the effectiveness of our proposed method compared to state-of-the-art methods.