Relation Attention for Temporal Action Localization
Temporal action localization aims to accurately localize and recognize all possible action instances from an untrimmed video automatically. Most existing methods perform this task by first generating a set of proposals and then recognizing each independently. However, due to the complex structures and large content variations in action instances, recognizing them individually can be difficult. Fortunately, some proposals often share information regarding one specific action. Such information, which is ignored in existing methods, can be used to boost recognition performance. In this paper, we propose a novel mechanism, called relation attention, to exploit informative relations among proposals based on their appearance or optical flow features. Specifically, we propose a relation attention module to enhance representation power by capturing useful information from other proposals. This module does not change the dimensions of the original input and output and does not rely on any specific proposal generation methods or feature extraction backbone networks. Experimental results show that the proposed relation attention mechanism improves performance significantly on both Thumos14 and ActivityNet1.3 datasets compared to existing architectures. For example, relying on Structured Segment Networks (SSN), the proposed relation attention module helps to increase the mAP from 41.4 to 43.7 on the Thumos14 dataset and outperforms the state-of-The-Art results.