LexSemTM: A semantic dataset based on all-words unsupervised sense distribution learning

Andrew Bennett; Timothy Baldwin; Jey Han Lau; Diana McCarthy; Francis Bond

doi:10.18653/v1/p16-1143

ACL 2016

Conference paper

07 Aug 2016

LexSemTM: A semantic dataset based on all-words unsupervised sense distribution learning

View publication

Abstract

There has recently been a lot of interest in unsupervised methods for learning sense distributions, particularly in applications where sense distinctions are needed. This paper analyses a state-of-the-art method for sense distribution learning, and optimises it for application to the entire vocabulary of a given language. The optimised method is then used to produce LexSemTM: a sense frequency and semantic dataset of unprecedented size, spanning approximately 88% of polysemous, English simplex lemmas, which is released as a public resource to the community. Finally, the quality of this data is investigated, and the LexSemTM sense distributions are shown to be superior to those based on the WordNet first sense for lemmas missing from SemCor, and at least on par with SEMCoR-based distributions otherwise.

Paper