Distributional-based Subword Embedding Space
A Distributional-based Subword Embedding Space is an text-item embedding space for subwords associated with a distributional subword embedding function (which maps to distributional subword vectors).
- Context:
- It can be created by a Distributional Subword Embedding Modeling System (that implements a distributional subword embedding modeling algorithm).
- It can range from being a Closed Distributional Subword Embedding (that applies only to the subwords in the training data) to being an Open Distributional Subword Embedding that transfers learning.
- It can be referenced by a Subword-level NLP Algorithm (such as subword-level seq2seq).
- …
- Example(s):
- one created by GloVe-based System (GloVe).
- …
- Counter-Example(s):
- See: Character Embeddings, Word Embeddings.
References
2019
- (Zhang et al., 2019) ⇒ Yijia Zhang, Qingyu Chen, Zhihao Yang, Hongfei Lin, and Zhiyong Lu. (2019). “BioWordVec, Improving Biomedical Word Embeddings with Subword Information and MeSH.” Scientific data 6, no. 1
- QUOTE: ... … Subsequently, we use the subword embedding model to learn the text sequences and MeSH term sequences in a unified n-gram embedding space. Our word embeddings are assessed for both validity and utility on multiple BioNLP tasks …
Bojanowski et al., 2017 proposed fastText: a subword embedding model based on the skip-gram model1 that learns the character n-grams distributed embeddings using unlabeled corpora where each word is represented as the sum of the vector representations of its n-grams. Compared to the word2vec model1, the subword embedding model can make effective use of the subword information and internal word structure to improve the embedding quality. In the biomedical domain, many specialized compound words, such as “deltaproteobacteria”, are rare or OOV in the training corpora, thus making them difficult to learn properly using the word2vec model. In contrast, the subword embedding model is naturally more suitable to deal with such situations. For instance, since “delta”, “proteo” and “bacteria” are common in the training corpora, the subword embedding model can learn the distributed representations of all character n-grams of “deltaproteobacteria”, and subsequently integrate the subword vectors to create the final embedding of “deltaproteobacteria”. In this study, we apply the subword embedding model to learn word embeddings from the joint text sequences of PubMed and MeSH. …
- QUOTE: ... … Subsequently, we use the subword embedding model to learn the text sequences and MeSH term sequences in a unified n-gram embedding space. Our word embeddings are assessed for both validity and utility on multiple BioNLP tasks …
2017
- (Bojanowski et al., 2017) ⇒ Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomáš Mikolov. (2017). “Enriching Word Vectors with Subword Information.” In: Transactions of the Association for Computational Linguistics, 5.