Sentence Embedding System

From GM-RKB
(Redirected from sentence encoder)
Jump to navigation Jump to search

A Sentence Embedding System is a text-item embedding encoder that accepts sentence items and produces sentence embeddings to define a sentence embedding space.



References

2024-12-29

[1] https://swimm.io/learn/large-language-models/5-types-of-word-embeddings-and-example-nlp-applications
[2] https://airbyte.com/data-engineering-resources/sentence-word-embeddings
[3] https://en.wikipedia.org/wiki/Sentence_embedding
[4] https://codesphere.com/articles/best-open-source-sentence-embedding-models
[5] https://spotintelligence.com/2022/12/17/sentence-embedding/
[6] https://stackoverflow.com/questions/59877385/what-is-the-difference-between-sentence-encodings-and-contextualized-word-embedd
[7] https://stackoverflow.blog/2023/11/09/an-intuitive-introduction-to-text-embeddings/
[8] https://osanseviero.github.io/hackerllama/blog/posts/sentence_embeddings/
[9] https://incubity.ambilio.com/sentence-embedding-vs-word-embedding-in-rag-model/

2024

2024

  • (Wikipedia, 2024) ⇒ https://en.wikipedia.org/wiki/Sentence_embedding Retrieved:2024-2-10.
    • In natural language processing, a sentence embedding refers to a numeric representation of a sentence in the form of a vector of real numbers which encodes meaningful semantic information. [1] [2] [3] State of the art embeddings are based on the learned hidden layer representation of dedicated sentence transformer models. BERT pioneered an approach involving the use of a dedicated [CLS] token prepended to the beginning of each sentence inputted into the model; the final hidden state vector of this token encodes information about the sentence and can be fine-tuned for use in sentence classification tasks. In practice however, BERT's sentence embedding with the [CLS] token achieves poor performance, often worse than simply averaging non-contextual word embeddings. SBERT later achieved superior sentence embedding performance by fine tuning BERT's [CLS] token embeddings through the usage of a siamese neural network architecture on the SNLI dataset. Other approaches are loosely based on the idea of distributional semantics applied to sentences. Skip-Thought trains an encoder-decoder structure for the task of neighboring sentences predictions. Though this has been shown to achieve worse performance than approaches such as InferSent or SBERT. An alternative direction is to aggregate word embeddings, such as those returned by Word2vec, into sentence embeddings. The most straightforward approach is to simply compute the average of word vectors, known as continuous bag-of-words (CBOW). However, more elaborate solutions based on word vector quantization have also been proposed. One such approach is the vector of locally aggregated word embeddings (VLAWE), which demonstrated performance improvements in downstream text classification tasks.

2018

  • (Wolf, 2018b) ⇒ Thomas Wolf. (2018). “The Current Best of Universal Word Embeddings and Sentence Embeddings." Blog post
    • QUOTE: Word and sentence embeddings have become an essential part of any Deep-Learning-based natural language processing systems. They encode words and sentences 📜 in fixed-length dense vectors 📐 to drastically improve the processing of textual data. A huge trend is the quest for Universal Embeddings: embeddings that are pre-trained on a large corpus and can be plugged in a variety of downstream task models (sentimental analysis, classification, translation…) to automatically improve their performance by incorporating some general word/sentence representations learned on the larger dataset. It’s a form of transfer learning. Transfer learning has been recently shown to drastically increase the performance of NLP models on important tasks such as text classification. …

      … There are currently many competing schemes for learning sentence embeddings. While simple baselines like averaging word embeddings consistently give strong results, a few novel unsupervised and supervised approaches, as well as multi-task learning schemes, have emerged in late 2017-early 2018 and lead to interesting improvements. Let’s go quickly through the four types of approaches currently studied: from simple word vector averaging baselines to unsupervised/supervised approaches and multi-task learning schemes (as illustrated above). There is a general consensus in the field that the simple approach of directly averaging a sentence’s word vectors (so-called Bag-of-Word approach) gives a strong baseline for many downstream tasks. A good algorithm for computing such a baseline is detailed in the work of Arora et al. published last year at ICLR, A Simple but Tough-to-Beat Baseline for Sentence Embeddings: use a popular word embeddings of your choice, encode a sentence in a linear weighted combination the word vectors and perform a common component removal (remove the projection of the vectors on their first principal component). This general method has deeper and powerful theoretical motivations that rely on a generative model which uses a random walk on a discourse vector to generate text

2017

  • (Nikhil, 2017) ⇒ Nishant Nikhil. (2017). “Sentence Embedding."
    • QUOTE: … One way to get a representation of sentences is to add all the representation of word vectors contained in it, it is termed as words centroid. And similarity between two sentences can be computed by centroid distance. Same thing can be extended to paragraphs and documents. But this method neglects a lot of information like the sequence and it might give false results. Like:
      • You are going there to teach not play.
      • You are going there to play not teach.

2015