W. Hsu, L. Kennedy, et al.
ICASSP 2004
This paper describes progress towards a general framework for incorporating multimodal cues into a trainable system for automatically annotating user-defined semantic concepts in broadcast video. Models of arbitrary concepts are constructed by building classifiers in a score space defined by a pre-deployed set of multimodal models. Results show annotation for user-defined concepts both in and outside the pre-deployed set is competitive with our best video-only models on the TREC Video 2002 corpus. An interesting side result shows speech-only models give performance comparable to our best video-only models for detecting visual concepts such as "outdoors", "face" and "cityscape".
W. Hsu, L. Kennedy, et al.
ICASSP 2004
G. Iyengar, P. Duygulu, et al.
MM 2005
G. Iyengar, P. Duygulu, et al.
MM 2005
G. Iyengar, A.B. Lippman
ICME 2000