Semantic Similarity Measure
A Semantic Similarity Measure is a similarity measure that approximates a semantic relationship (between two (or more) meaning carriers).
- Context:
- range: Semantic Similarity Score.
- It doesn't not take into account antonymy and meronymy.
- It can be modeled by a Semantic Similarity Modelling Task.
- It can be evaluated by a Semantic Similarity Benchmark Task.
- It can range from being a Semantic Subword Similarity Measure to being a Semantic Word Similarity Measure.
- It can range from to being a Semantic Sentence Similarity Measure to being a Semantic Textual Similarity (STS) Measure.
- It can range from being a Symmetric Semantic Similarity Measure to being a Non-Symmetric Semantic Similarity Measure.
- It can range from being a Feature-based Semantic Similarity Measure to being a Graph-based Semantic Similarity Measure.
- It can range from being a Topological Semantic Similarity Measure to being a Statistical Semantic Similarity Measure.
- It can range from being an Edge-based Semantic Similarity Measure to being a Node-based Semantic Similarity Measure.
- It can range between being a Pairwise Semantic Similarity Measure to being a Groupwise Semantic Similarity Measure.
- It can range from being an Extensional-based Semantic Similarity Measure to being an Intentional-based Semantic Similarity Measure.
- It can range from being a Corpus-based Semantic Similarity Measure to being a Knowledge-based Semantic Similarity Measure.
- It can range from being a Taxonomy-based Semantic Similarity Measure to being a Ontology-based Semantic Similarity Measure.
- Example(s):
- Counter-Example(s):
- See: Similarity Matrix, Clustering Task, Local Search, Semantic Similarity Neural Network, Semantic Graph Database.
References
2021a
- (Chandrasekaran & Mago, 2021) ⇒ Dhivya Chandrasekaran, and Vijay Mago. (2021). “Evolution of Semantic Similarity - A Survey.” In: ACM Computing Surveys, 54(2).
- QUOTE: Semantic similarity methods usually give a ranking or percentage of similarity between texts, rather than a binary decision as similar or not similar. Semantic similarity is often used synonymously with semantic relatedness. However, semantic relatedness not only accounts for the semantic similarity between texts but also considers a broader perspective analyzing the shared semantic properties of two words. For example, the words ‘
coffee
’ and ‘mug
’ may be related to one another closely, but they are not considered semantically similar whereas the words ‘coffee
’ and ‘tea
’ are semantically similar. Thus, semantic similarity may be considered, as one of the aspects of semantic relatedness. The semantic relationship including similarity is measured in terms of semantic distance, which is inversely proportional to the relationship (...)
- QUOTE: Semantic similarity methods usually give a ranking or percentage of similarity between texts, rather than a binary decision as similar or not similar. Semantic similarity is often used synonymously with semantic relatedness. However, semantic relatedness not only accounts for the semantic similarity between texts but also considers a broader perspective analyzing the shared semantic properties of two words. For example, the words ‘
2021a
- (Wikipedia, 2021) ⇒ https://en.wikipedia.org/wiki/Semantic_similarity Retrieved:2021-5-29.
- Semantic similarity is a metric defined over a set of documents or terms, where the idea of distance between items is based on the likeness of their meaning or semantic content as opposed to lexicographical similarity. These are mathematical tools used to estimate the strength of the semantic relationship between units of language, concepts or instances, through a numerical description obtained according to the comparison of information supporting their meaning or describing their nature.[1] [2] The term semantic similarity is often confused with semantic relatedness. Semantic relatedness includes any relation between two terms, while semantic similarity only includes "is a" relations.
For example, "car" is similar to "bus", but is also related to "road" and "driving".
Computationally, semantic similarity can be estimated by defining a topological similarity, by using ontologies to define the distance between terms/concepts. For example, a naive metric for the comparison of concepts ordered in a partially ordered set and represented as nodes of a directed acyclic graph (e.g., a taxonomy), would be the shortest-path linking the two concept nodes. Based on text analyses, semantic relatedness between units of language (e.g., words, sentences) can also be estimated using statistical means such as a vector space model to correlate words and textual contexts from a suitable text corpus. The evaluation of the proposed semantic similarity / relatedness measures are evaluated through two main ways. The former is based on the use of datasets designed by experts and composed of word pairs with semantic similarity / relatedness degree estimation. The second way is based on the integration of the measures inside specific applications such the information retrieval, recommender systems, natural language processing, etc.
- Semantic similarity is a metric defined over a set of documents or terms, where the idea of distance between items is based on the likeness of their meaning or semantic content as opposed to lexicographical similarity. These are mathematical tools used to estimate the strength of the semantic relationship between units of language, concepts or instances, through a numerical description obtained according to the comparison of information supporting their meaning or describing their nature.[1] [2] The term semantic similarity is often confused with semantic relatedness. Semantic relatedness includes any relation between two terms, while semantic similarity only includes "is a" relations.
- ↑ Harispe S.; Ranwez S. Janaqi S.; Montmain J. (2015). “Semantic Similarity from Natural Language and Ontology Analysis". Synthesis Lectures on Human Language Technologies. 8:1: 1–254.
- ↑ Feng Y.; Bagheri E.; Ensan F.; Jovanovic J. (2017). “The state of the art in semantic relatedness: a framework for comparison". Knowledge Engineering Review. 32: 1–30. doi:10.1017/S0269888917000029.
2017a
- (SemEval, 2017) ⇒ SemEval-2017 Task 2: https://alt.qcri.org/semeval2017/task2/
- QUOTE: Semantic similarity is a core field of Natural Language Processing (NLP) which deals with measuring the extent to which two linguistic items are similar. In particular, the word semantic similarity framework is widely accepted as the most direct in-vitro evaluation of semantic vector space models (e.g., word embeddings) and in general semantic representation techniques. As a result, word similarity datasets play a major role in the advancement of research in lexical semantics. Given the importance of moving beyond the barriers of English language by developing language-independent techniques, the SemEval-2017 Task 2 provides a reliable framework for evaluating both monolingual and multilingual semantic representations, and similarity techniques.
2017b
- (Harispe et al., 2017) ⇒ Sebastien Harispe, Sylvie Ranwez, Stefan Janaqi, and Jacky Montmain. (2017). “Semantic Similarity from Natural Language and Ontology Analysis.” In: CoRR, abs/1704.05295.
2015
- (Wikipedia, 2015) ⇒ http://en.wikipedia.org/wiki/Semantic_similarity#Statistical_similarity Retrieved:2014-12-10.
- Statistical Similarity.
- LSA (Latent semantic analysis) (+) vector-based, adds vectors to measure multi-word terms; (−) non-incremental vocabulary, long pre-processing times
- PMI (Pointwise mutual information) (+) large vocab, because it uses any search engine (like Google); (−) cannot measure relatedness between whole sentences or documents
- SOC-PMI (Second-order co-occurrence pointwise mutual information) (+) sort lists of important neighbor words from a large corpus; (−) cannot measure relatedness between whole sentences or documents
- GLSA (Generalized Latent Semantic Analysis) (+) vector-based, adds vectors to measure multi-word terms; (−) non-incremental vocabulary, long pre-processing times
- ICAN (Incremental Construction of an Associative Network) (+) incremental, network-based measure, good for spreading activation, accounts for second-order relatedness; (−) cannot measure relatedness between multi-word terms, long pre-processing times
- NGD (Normalized Google distance) (+) large vocab, because it uses any search engine (like Google); (−) can measure relatedness between whole sentences or documents but the larger the sentence or document the more ingenuity is required, Cilibrasi & Vitanyi (2007), reference below.
- NCD (Normalized Compression Distance)
- ESA (Explicit Semantic Analysis) based on Wikipedia and the ODP.
- SSA (Salient Semantic Analysis) which indexes terms using salient concepts found in their immediate context.
- n° of Wikipedia (noW), inspired by the game Six Degrees of Wikipedia, is a distance metric based on the hierarchical structure of Wikipedia. A directed-acyclic graph is first constructed and later, Dijkstra's shortest path algorithm is employed to determine the noW value between two terms as the geodesic distance between the corresponding topics (i.e. nodes) in the graph.
- VGEM (Vector Generation of an Explicitly-defined Multidimensional Semantic Space) (+) incremental vocab, can compare multi-word terms (−) performance depends on choosing specific dimensions
- BLOSSOM (Best path Length On a Semantic Self-Organizing Map) (+) uses a Self Organizing Map to reduce high-dimensional spaces, can use different vector representations (VGEM or word-document matrix), provides 'concept path linking' from one word to another (−) highly experimental, requires nontrivial SOM calculation
- SimRank
- Statistical Similarity.
2008
- (Milne & Witten, 2008b) ⇒ David N. Milne, and Ian H. Witten. (2008). “An Effective, Low-cost Measure of Semantic Relatedness Obtained from Wikipedia Links.” In: Proceedings of the AAAI 2008 Workshop on Wikipedia and Artificial Intelligence (WIKIAI 2008).
2009
- (Pesquita et al., 2009 ) ⇒ Catia Pesquita, Daniel Faria, Andre O. Falcao, Phillip Lord, and Francisco M. Couto (2009). "Semantic Similarity in Biomedical Ontologies". In: PLoS Computational Biology 5(7): e1000443.
2007
- (Cilibrasi & Vitanyi, 2007) ⇒ R. L. Cilibrasi and P. M. B. Vitanyi. (2007). “The Google Similarity Distance.” In: IEEE Transactions on Knowledge and Data Engineering 19(3). doi:10.1109/TKDE.2007.48
2006
- (Budanitsky et al., 2006) ⇒ Alexander Budanitsky, and Graeme Hirst. (2006). “Evaluating WordNet-based Measures of Lexical Semantic Relatedness.” In: Computational Linguistics Journal, 32(1). doi:10.1162/coli.2006.32.1.13
- QUOTE: The need to determine semantic relatedness or its inverse, semantic distance, between two lexically expressed concepts is a problem that pervades much of natural language processing.
1989
- (Church et al., 1989) ⇒ Kenneth W. Church, and P. Hanks. (1989). “Word Association Norms, Mutual Information and Lexicography". In: Proceedings of the 27th Annual Conference of the Association of Computational Linguistics (ACL 1989).