Chandan Biswas, Debasis Ganguly, et al.
Pattern Recognition
A standard word embedding algorithm, such as ‘word2vec’, embeds each word as a dense vector of a preset dimensionality, the components of which are learned by maximizing the likelihood of predicting the context around it. However, as an inherent linguistic phenomenon, it is evident that there is a varying degree of difficulty in identifying words from their contexts. This suggests that a variable granularity in word vector representation may be useful to obtain sparser and more compressed word representations, requiring less storage space. To that end, in this paper, we propose a word vector training algorithm that uses a variable number of components to represent words. Given a text collection of documents, our algorithm, similar to the skip-gram approach of word2vec, learns to predict the context of a word given the current instance of a word. However, in contrast to skip-gram, which uses a static number of dimensions for each word vector, we propose to dynamically increase the dimensionality as a stochastic function of the prediction error. Our experiments with standard test collections demonstrate that our word representation method is able to achieve comparable (and sometimes even better) effectiveness than skip-gram word2vec, using a significantly smaller number of parameters (achieving compression ratio of around 65%).
Chandan Biswas, Debasis Ganguly, et al.
Pattern Recognition
Chandan Biswas, Dwaipayan Roy, et al.
CIKM 2019
Debasis Ganguly, Martin Gleize, et al.
AMIA ... Annual Symposium proceedings. AMIA Symposium
Suchana Datta, Debasis Ganguly, et al.
FIRE 2020