LexSemTM: A semantic dataset based on all-words unsupervised sense distribution learning
There has recently been a lot of interest in unsupervised methods for learning sense distributions, particularly in applications where sense distinctions are needed. This paper analyses a state-of-the-art method for sense distribution learning, and optimises it for application to the entire vocabulary of a given language. The optimised method is then used to produce LexSemTM: a sense frequency and semantic dataset of unprecedented size, spanning approximately 88% of polysemous, English simplex lemmas, which is released as a public resource to the community. Finally, the quality of this data is investigated, and the LexSemTM sense distributions are shown to be superior to those based on the WordNet first sense for lemmas missing from SemCor, and at least on par with SEMCoR-based distributions otherwise.