GenSim System
A GenSim System is a Word Embedding System that builds semantic vectors from plain text documents by examining statistical co-occurrence patterns within a training corpus.
- Context:
- GitHub repository: https://github.com/RaRe-Technologies/gensim
- Source code: https://pypi.org/project/gensim/0.13.1/
- It was first introduced by Rehurek & Sojka(2010).
- …
- Example(s):
- Counter-Example(s):
- BERT System (Devlin et al., 2019),
- DISSECT System (Dinu et al., 2013),
- ELMo System (Peters et al., 2018),
- fastText System (Bojanowski et al., 2017),
- Flair Word Embedding System (Akbik et al. (2018).
- GloVe System (Pennington et al., 2014),
- Indra System (Sales et al., 2018),
- JoBimText System (Biemann & Riedl, 2013),
- MIMICK System (Pinter et al., 2017),
- MorphoRNN Embedding System (Luong et al., 2013),
- Polyglot System (Al-Rfou et al., 2013),
- SENNA System (Collobert & Weston, 2008),
- S-Space Word Embedding System (Jurgens & Stevens, 2010),
- SumEmbed System (Botha & Blunsom, 2014),
- VarEmbed System (Bhatia et al., 2016),
- Word2Vec System (Mikolov et al., 2014).
- See: One-Hot Encoding System, DeepLearning4J, Word Similarity Task, Word Analogy Task, Distributional Co-Occurrence Word Vector, Character Embedding System, Graph Embedding System, Subword Embedding System.
References
2021
- (Gensim, 2021) ⇒ https://radimrehurek.com/gensim/intro.html#what-is-gensim
- QUOTE: Gensim is a free open-source Python library for representing documents as semantic vectors, as efficiently (computer-wise) and painlessly (human-wise) as possible.
Gensim is designed to process raw, unstructured digital texts (“plain text”) using unsupervised machine learning algorithms.
The algorithms in Gensim, such as Word2Vec, FastText, Latent Semantic Indexing (LSI, LSA, LsiModel), Latent Dirichlet Allocation (LDA, LdaModel) etc, automatically discover the semantic structure of documents by examining statistical co-occurrence patterns within a corpus of training documents. These algorithms are unsupervised, which means no human input is necessary – you only need a corpus of plain text documents.
Once these statistical patterns are found, any plain text documents (sentence, phrase, word ...) can be succinctly expressed in the new, semantic representation and queried for topical similarity against other documents (words, phrases...).
- QUOTE: Gensim is a free open-source Python library for representing documents as semantic vectors, as efficiently (computer-wise) and painlessly (human-wise) as possible.
2010
- (Rehurek & Sojka, 2010) ⇒ Radim Rehurek, and Petr Sojka. (2010). “Software Framework for Topic Modelling with Large Corpora.” In: Proceedings of the LREC 2010 workshop on new challenges for NLP frameworks.