What When and Where? Self-Supervised Spatio Temporal Grounding in Untrimmed Multi-Action Videos from Narrated InstructionsBrian ChenNina Shvetsovaet al.2024CVPR 2024
C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video RetrievalAndrew RouditchenkoYung-Sung Chuanget al.2023ICASSP 2023
Everything at Once - Multi-modal Fusion Transformer for Video RetrievalNina ShvetsovaBrian Chenet al.2022CVPR 2022