Co-training non-robust classifiers for video semantic concept detection
Abstract
Semantic video characterization by automatic metadata tagging is increasingly popular. While some of these concepts are unimodal manifest in image or audio modalities, a large number of such concepts are multimodal manifest in both the image and the audio modalities. Further while some concepts like Outdoors and Face occur sufficiently in terms of frequency of occurrence in training sets, a large number are rarer to find thus making them difficult to detect during automatic annotation. Semi-supervised learning algorithms such as co-training may help by incorporating a large amount of unlabeled. data, which holds the promise of allowing the redundant information across views to improve the learning performance. Unfortunately, this promise has not been realized in multimedia content analysis partly because the models built using the labeled data alone are not too robust and their noisy classification of the unlabeled data set compounds problems faced by the co-training algorithm. In this paper we analyze whether a judicious application of co-training for automatically labeling some of the unlabeled samples and reinducting them into the training set along with manual quality control can help improve the detection performance. We report our findings in the context of the TRECVID 2003 common annotation corpus. © 2005 IEEE.