Robust detection of visual ROI for automatic speechreading
G. Iyengar, G. Potamianos, et al.
MMSP 2001
This paper describes progress towards a general framework for incorporating multimodal cues into a trainable system for automatically annotating user-defined semantic concepts in broadcast video. Models of arbitrary concepts are constructed by building classifiers in a score space defined by a pre-deployed set of multimodal models. Results show annotation for user-defined concepts both in and outside the pre-deployed set is competitive with our best video-only models on the TREC Video 2002 corpus. An interesting side result shows speech-only models give performance comparable to our best video-only models for detecting visual concepts such as "outdoors", "face" and "cityscape".
G. Iyengar, G. Potamianos, et al.
MMSP 2001
G. Iyengar, H.J. Nock, et al.
ICME 2003
W. Hsu, L. Kennedy, et al.
ICASSP 2004
G. Iyengar, H.J. Nock, et al.
ICME 2003