Learning variable-length representation of words
A standard word embedding algorithm, such as ‘word2vec’, embeds each word as a dense vector of a preset dimensionality, the components of which are learned by maximizing the likelihood of predicting the context around it. However, as an inherent linguistic phenomenon, it is evident that there is a varying degree of difficulty in identifying words from their contexts. This suggests that a variable granularity in word vector representation may be useful to obtain sparser and more compressed word representations, requiring less storage space. To that end, in this paper, we propose a word vector training algorithm that uses a variable number of components to represent words. Given a text collection of documents, our algorithm, similar to the skip-gram approach of word2vec, learns to predict the context of a word given the current instance of a word. However, in contrast to skip-gram, which uses a static number of dimensions for each word vector, we propose to dynamically increase the dimensionality as a stochastic function of the prediction error. Our experiments with standard test collections demonstrate that our word representation method is able to achieve comparable (and sometimes even better) effectiveness than skip-gram word2vec, using a significantly smaller number of parameters (achieving compression ratio of around 65%).