Web-based information content and its application to concept-based video retrieval

Alexander Haubold; Apostol Natsev

doi:10.1145/1386352.1386408

CIVR 2008

Conference paper

17 Dec 2008

Web-based information content and its application to concept-based video retrieval

View publication

Abstract

Semantic similarity between words or phrases is frequently used to find matching correlations between search queries and documents when straightforward matching of terms fails. This is particularly important for searching in visual databases, where pictures or video clips have been automatically tagged with a small set of semantic concepts based on analysis and classification of the visual content. Here, the textual description of documents is very limited, and semantic similarity based on WordNet's cognitive synonym structure, along with information content derived from term frequencies, can help to bridge the gap between an arbitrary textual query and a limited vocabulary of visual concepts. This approach, termed concept-based retrieval, has received significant attention over the last few years, and its success is highly dependent on the quality of the similarity measure used to map textual query terms to visual concepts. In this paper, we consider some issues of semantic similarity measures based on Information Content (IC), and propose a way to improve them. In particular, we note that most IC-based similarity measures are derived from a small and relatively outdated corpus (the Brown corpus), which does not adequately capture the usage pattern of many contemporary terms: for example, out of more than 150,000 WordNet terms, only about 36,000 are represented. This shortcoming reflects very negatively on the coverage of typical search query terms. We therefore suggest using alternative IC corpora that are larger and better aligned with the usage of modern vocabulary. We experimentally derive two such corpora using the WWW Google search engine, and show that they provide better coverage of vocabulary, while showing comparable frequencies for Brown corpus terms. Finally, we evaluate the two proposed IC corpora in the context of a concept-based video retrieval application using the TRECVID 2005, 2006, and 2007 datasets, and we show that they increase average precision results by up to 200%. Copyright 2008 ACM.

Conference paper