Conference paper

Foundation Model and Temporal Priors-guided Transductive Few-shot Action Recognition

Abstract

Dynamic Time Warping (DTW) is a widely used metric for time series matching. However, when applied to few-shot action recognition (FSAR), DTW often encounters the "identical matching" issue, where multiple frames from one video are matched to a single frame from another. To address this, we introduce FTP-FSAR, a novel metric-based FSAR approach designed to mitigate this challenge. FTP-FSAR proposes a novel alignment metric that incorporates temporal priors, guiding the matching process by encouraging the alignment of frames with similar temporal progression, thus improving the accuracy of frame matching. Additionally, FTP-FSAR integrates a dual framework, combining a foundation model with transductive learning to optimize feature extraction. Extensive experiments across multiple datasets demonstrate that FTP-FSAR outperforms existing methods, achieving the best results in 3 out of 4 benchmarks across 1-shot, 3-shot, and 5-shot settings, with performance improvements of up to 4.5%.

Related