Out-Of-Vocabulary (OOV) Word

A Out-Of-Vocabulary (OOV) Word is a Linguistic Unit or a token that does not appear in training vocabulary or document.

AKA: Out-Of-Vocabulary (OOV) Linguistic Unit, Out-Of-Vocabulary (OOV) Token.
Context:
- It can be model by Out-Of-Vocabulary (OOV) Word Modeling Task and detected by OOV Word Recognizer.
- It can be generated by Out-Of-Vocabulary (OOV) Word Generation System and translated by Out-Of-Vocabulary (OOV) Word Translation System.
- It can range from being a Unknown Word to being a Rare Word.
- …
Counter-Example(s):
- a word that has not been seen during training.
- a token that has not been seen during training.
- an Out-Of-Vocabulary (OOV) Morpheme,
- an Out-Of-Vocabulary (OOV) Named Entity.
Example(s):
- an Out-Of-Vocabulary (OOV) Subword.
- an In-Vocabulary Word.
See: Word Embedding, Subword Unit, Vocabulary Word, Lexicon Word, Shorthand Word, Abbreviated Word, Lengthened Word, Unsupervised Transliteration Model, Lexical Normalization Task.

References

2017a

(Goldberg, 2017) ⇒ Yoav Goldberg. (2017). “Neural Network Methods for Natural Language Processing.” In: Synthesis Lectures on Human Language Technologies, 10(1). doi:10.2200/S00762ED1V01Y201703HLT037
- QUOTE: If we have the word form as a feature, why do we need the prefixes and suffixes? After all they are deterministic functions of the word. The reason is that if we encounter a word that we have not seen in training (out of vocabulary or OOV word) or a word we’ve seen only a handful of times in training (a rare word), we may not have robust enough information to base a decision on. In such cases, it is good to back-off to the prefixes and suffixes, which can provide useful hints. By including the prefix and suffix features also for words that are observed many times in training, we allow the learning algorithms to better adjust their weights, and hopefully use them properly when encountering OOV words.

2017b

(Ruder, 2017) ⇒ Sebastian Ruder (2017). "Word embeddings in 2017: Trends and future directions".
- QUOTE: One of the main problems of using pre-trained word embeddings is that they are unable to deal with out-of-vocabulary (OOV) words, i.e. words that have not been seen during training. Typically, such words are set to the UNK token and are assigned the same vector, which is an ineffective choice if the number of OOV words is large. Subword-level embeddings as discussed in the last section are one way to mitigate this issue. Another way, which is effective for reading comprehension (Dhingra et al., 2017) is to assign OOV words their pre-trained word embedding, if one is available.
  Recently, different approaches have been proposed for generating embeddings for OOV words on-the-fly. Herbelot and Baroni (2017) initialize the embedding of OOV words as the sum of their context words and then rapidly refine only the OOV embedding with a high learning rate. Their approach is successful for a dataset that explicitly requires to model nonce words, but it is unclear if it can be scaled up to work reliably for more typical NLP tasks. Another interesting approach for generating OOV word embeddings is to train a character-based model to explicitly re-create pre-trained embeddings (Pinter et al., 2017). This is particularly useful in low-resource scenarios, where a large corpus is inaccessible and only pre-trained embeddings are available.

2017c

(Dhingra et al., 2017) ⇒ Bhuwan Dhingra, Hanxiao Liu, Ruslan Salakhutdinov, and William W. Cohen. (2017). “A Comparative Study of Word Embeddings for Reading Comprehension.” In: arXiv, abs/1703.00993.
- QUOTE: In this section we study some common techniques for dealing with OOV tokens at test time. Based on the results from the previous section, we conduct this study using only the off-the-shelf GloVe pre-trained embeddings.(...). Before training a neural network for RC, the developer must first decide on the set of words $V$ which will be assigned word vectors. Any token outside $V$ is treated as an OOV token (denoted by UNK) and is assigned the same fixed vector (...).

Note that if $w$ is an out-of-vocabulary (OOV) word, then $P_{vocab(w)}$ is zero; similarly if $w$ does not appear in the source document, then $\sum_{i:w_i=w} a^t_i$ is zero. The ability to produce OOV words is one of the primary advantages of pointer-generator models; by contrast models such as our baseline are restricted to their pre-set vocabulary.

The loss function is as described in equations (6) and (7), but with respect to our modified probability distribution $P(w)$ given in equation (9).

**Figure 3:** Pointer-generator model. For each decoder timestep a generation probability $p_{gen} \in [0,1]$ is calculated, which weights the probability of generating words from the vocabulary, versus copying words from the source text. The vocabulary distribution and the attention distribution are weighted and summed to obtain the final distribution, from which we make our prediction. Note that out-of-vocabulary article words such as *2-0* are included in the final distribution. Best viewed in color.

2014

(Durrani et al., 2014) ⇒ Nadir Durrani, Hassan Sajjad, Hieu Hoang, and Philipp Koehn. (2014). “Integrating An Unsupervised Transliteration Model Into Statistical Machine Translation.” In: Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2014).
- QUOTE: All machine translation (MT) systems suffer from the existence of out-of-vocabulary (OOV) words, irrespective of the amount of data available for training. OOV words are mostly named entities, technical terms or foreign words that can be translated to the target language using transliteration.

2000

(Bazzi & Glass, 2000) ⇒ Issam Bazzi, and James R. Glass. (2000). “Modeling Out-Of-Vocabulary Words For Robust Speech Recognition.” In: Sixth International Conference on Spoken Language Processing (ICSLP 2000) / (INTERSPEECH 2000).
- QUOTE: Out-of-vocabulary (OOV) words are a common occurrence in many speech recognition applications, and are a known source of recognition errors [2]. For example, in our JUPITER weather information domain the OOV rate is approximately 2%, and over 13% of the utterances contain OOV words [12].
  There are three different problems which can be associated with OOV words. The first problem is that of detecting the presence of an OOV word(s). Given an utterance, we want to find out if it has any words that the recognizer does not have in its vocabulary. The second problem is the accurate recognition of the underlying sequence of sub-word units (e.g., phones) corresponding to the OOV word. The third problem is the sound-to-letter problem, which might involve converting the sub-word sequence into an actual word so that it may be understood semantically [9]. Most of the work in the literature addresses the first problem, that is the detection of OOV words. The most common approach is to incorporate some form of filler or garbage model which is used to absorb OOV words and non-speech artifacts. This approach has been effectively used in key-word spotting for example, where the recognizer vocabulary primarily contains key-words, so that the filler models are used extensively [11, 8]. In these applications, non key-words absorbed by the filler model are of little subsequent interest. Our work differs from these applications in that we are very interested with accurately recovering the underlying sub-word sequence of an OOV word for the purpose of ultimately recognizing the word. Although in this paper we start with a simple phone-based model, and do not evaluate its accuracy, we are ultimately interested in increasing the complexity of the OOV model by incorporating additional sub-word structure, so that we can accurately recognize OOV words while not degrading the performance of the word-based recognizer.