Identifying ambiguity in semantic resources
In many Information Extraction tasks, dictionaries and lexica are powerful building blocks for sophisticated extractions. The success of the Semantic Web in the last 10 years has produced an unprecedented quantity of available structured data that can be leveraged to produce dictionaries on countless concepts in many domains. While being an invaluable resource, these automatically built dictionaries may contain "problematic" items, such as spurious words, which have been included by mistake, or ambiguous words, which appear with multiple different meanings in the target corpus and therefore necessitating an expensive disambiguation task. In this paper, we propose a simple and effective method to identify problematic terms in a given dictionary, which are ambiguous or spurious with respect to a given corpus, with the aim to facilitate subsequent Information Extraction tasks. We prove the effectiveness of the method with a systematic experiment on publicly available concept dictionaries, using a very large Web corpus as target, with an average precision in identifying a problem term above 85%.