COIN – An Inexpensive and Strong Baseline for Predicting Out of Vocabulary Word Embeddings

Andrew Schneider; Lihong He; Zhijia Chen; Arjun Mukherjee; Eduard Dragut

COLING 2022

Conference paper

12 Oct 2022

COIN – An Inexpensive and Strong Baseline for Predicting Out of Vocabulary Word Embeddings

Abstract

Predicting word embeddings for out of vocabulary words remains an important challenge for NLP tools. Word embedding models only include terms that occur a sufficient number of times in their training corpora. Word embedding vector models attempt to approximate information about a word not in their vocabularies. We propose a fast method for predicting vectors for out of vocabulary terms that makes use of the surrounding terms of the unknown term and the hidden context layer of the word2vec model. We propose this method as a strong baseline in the sense that 1) while it does not surpass all state-of-the-art methods, it surpasses several techniques for vector prediction on benchmark tasks, 2) even when it underperforms, the margin is small retaining competitive performance in downstream tasks, and 3) it is inexpensive to compute, requiring no additional training stage. We also show that our technique can be incorporated into existing methods to achieve a new state-of-the-art on the word vector prediction problem.

Conference paper