Semantic Word Similarity (SWS) Measure
A Semantic Word Similarity (SWS) Measure is a linguistic semantic similarity measure for lexical items.
- AKA: Lexical Semantic Similarity.
- Context:
- domain: Lexical Item Set, typically a lexical item pair.
- range: a Lexical Semantic Similarity Score.
- It can range from being a Corpus-based Lexical Semantic Similarity to being a Knowledge Base-based Semantic Similarity (such as WordNet lexical similarity).
- It can be created by a Lexical Semantic Similarity Measure Creation Task.
- It can support Lexical Similarity Tasks (such as a word similarity task, List Similar Words Task, Lexical Analogy Task, ...)
- It can (typically) involves a domain of Lexical Item Sets, usually a lexical item pair, and outputs a Lexical Semantic Similarity Score.
- ...
- Example(s):
- WordSim-353,
- WS-Sim,
- Path Distance Similarity,
- Leacock Chodorow Similarity,
- Wu-Palmer Similarity,
- Resnik Similarity,
- Jiang-Conrath Similarity,
- Lin Similarity,
- word2vec Similarity.
- WikiRelate Similarity, by a WikiRelate System.
- WordNet-SenseRelate Similarity, by a WordNet-SenseRelate System.
- …
LSSM("queen", "queens") ⇒ 0.00128
LSSM("queen", "woman") ⇒ 0.00832
LSSM("queen", "king") ⇒ 0.0247
LSSM("Air Canada", "American Airways") ⇒ 0.0247
- ...
- Counter-Example(s):
- See: Word Sense Classification, Lexical Chain, Distributional Word Semantics Heuristic, Word Similarity Dataset.
References
2021a
- (Chandrasekaran & Mago, 2021) ⇒ Dhivya Chandrasekaran, and Vijay Mago. (2021). “Evolution of Semantic Similarity - A Survey.” In: ACM Computing Surveys, 54(2).
- QUOTE: Semantic similarity methods usually give a ranking or percentage of similarity between texts, rather than a binary decision as similar or not similar. Semantic similarity is often used synonymously with semantic relatedness. However, semantic relatedness not only accounts for the semantic similarity between texts but also considers a broader perspective analyzing the shared semantic properties of two words. For example, the words ‘
coffee
’ and ‘mug
’ may be related to one another closely, but they are not considered semantically similar whereas the words ‘coffee
’ and ‘tea
’ are semantically similar. Thus, semantic similarity may be considered, as one of the aspects of semantic relatedness. The semantic relationship including similarity is measured in terms of semantic distance, which is inversely proportional to the relationship (...)
- QUOTE: Semantic similarity methods usually give a ranking or percentage of similarity between texts, rather than a binary decision as similar or not similar. Semantic similarity is often used synonymously with semantic relatedness. However, semantic relatedness not only accounts for the semantic similarity between texts but also considers a broader perspective analyzing the shared semantic properties of two words. For example, the words ‘
2021a
- (Wikipedia, 2021) ⇒ https://en.wikipedia.org/wiki/Semantic_similarity Retrieved:2021-5-29.
- Semantic similarity is a metric defined over a set of documents or terms, where the idea of distance between items is based on the likeness of their meaning or semantic content as opposed to lexicographical similarity. These are mathematical tools used to estimate the strength of the semantic relationship between units of language, concepts or instances, through a numerical description obtained according to the comparison of information supporting their meaning or describing their nature.[1] [2] The term semantic similarity is often confused with semantic relatedness. Semantic relatedness includes any relation between two terms, while semantic similarity only includes "is a" relations.
For example, "car" is similar to "bus", but is also related to "road" and "driving".
Computationally, semantic similarity can be estimated by defining a topological similarity, by using ontologies to define the distance between terms/concepts. For example, a naive metric for the comparison of concepts ordered in a partially ordered set and represented as nodes of a directed acyclic graph (e.g., a taxonomy), would be the shortest-path linking the two concept nodes. Based on text analyses, semantic relatedness between units of language (e.g., words, sentences) can also be estimated using statistical means such as a vector space model to correlate words and textual contexts from a suitable text corpus. The evaluation of the proposed semantic similarity / relatedness measures are evaluated through two main ways. The former is based on the use of datasets designed by experts and composed of word pairs with semantic similarity / relatedness degree estimation. The second way is based on the integration of the measures inside specific applications such the information retrieval, recommender systems, natural language processing, etc.
- Semantic similarity is a metric defined over a set of documents or terms, where the idea of distance between items is based on the likeness of their meaning or semantic content as opposed to lexicographical similarity. These are mathematical tools used to estimate the strength of the semantic relationship between units of language, concepts or instances, through a numerical description obtained according to the comparison of information supporting their meaning or describing their nature.[1] [2] The term semantic similarity is often confused with semantic relatedness. Semantic relatedness includes any relation between two terms, while semantic similarity only includes "is a" relations.
- ↑ Harispe S.; Ranwez S. Janaqi S.; Montmain J. (2015). “Semantic Similarity from Natural Language and Ontology Analysis". Synthesis Lectures on Human Language Technologies. 8:1: 1–254.
- ↑ Feng Y.; Bagheri E.; Ensan F.; Jovanovic J. (2017). “The state of the art in semantic relatedness: a framework for comparison". Knowledge Engineering Review. 32: 1–30. doi:10.1017/S0269888917000029.
2011
- (NLTK - WordNetCorpusReader Module, 2011-Jun-19) ⇒ http://nltk.googlecode.com/svn/trunk/doc/api/nltk.corpus.reader.wordnet.WordNetCorpusReader-class.html
2008
- (Milne & Witten, 2008b) ⇒ David N. Milne, and Ian H. Witten. (2008). “An Effective, Low-cost Measure of Semantic Relatedness Obtained from Wikipedia Links.” In: Proceedings of the AAAI 2008 Workshop on Wikipedia and Artificial Intelligence (WIKIAI 2008).
2007
- (Cilibrasi & Vitanyi, 2007) ⇒ R. L. Cilibrasi and P. M. B. Vitanyi. (2007). “The Google Similarity Distance.” In: IEEE Transactions on Knowledge and Data Engineering 19(3). doi:10.1109/TKDE.2007.48
- (Pedersen et al., 2007) ⇒ Ted Pedersen, Serguei V.S. Pakhomov, Siddharth Patwardhan, and Christopher G. Chute. (2007). “Measures of Semantic Similarity and Relatedness in the Biomedical Domain.” In: Journal of Biomedical Informatics, 2007
- (Gabrilovich and Markovitch, 2007) ⇒ Evgeniy Gabrilovich, and Shaul Markovitch. (2007). “Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis.” In: Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI 2007).
2006
- (Patwardhan & Pedersen, 2006) ⇒ Siddharth Patwardhan, and Ted Pedersen. (2006). “Using WordNet Based Context Vectors to Estimate the Semantic Relatedness of Concepts.” In: Proceedings of the EACL 2006 Workshop on Making Sense of Sense - Bringing Computational Linguistics and Psycholinguistics Together.
- (Strube and Ponzetto, 2006) ⇒ Michael Strube, and Simone P. Ponzetto. (2006). “WikiRelate! Computing Semantic Relatedness Using Wikipedia.” In: Proceedings of AAAI-06.
- (Budanitsky et al., 2006) ⇒ Alexander Budanitsky, and Graeme Hirst. (2006). “Evaluating WordNet-based Measures of Lexical Semantic Relatedness.” In: Computational Linguistics Journal, 32(1). doi:10.1162/coli.2006.32.1.13
- QUOTE: The need to determine semantic relatedness or its inverse, semantic distance, between two lexically expressed concepts is a problem that pervades much of natural language processing.
2004
- (Pedersen et al., 2004) ⇒ Ted Pedersen, Siddharth Patwardhan, and Jason Michelizzi. (2004). “WordNet::Similarity - Measuring the Relatedness of Concepts.” In: Proceedings of the Nineteenth National Conference on Artificial Intelligence - Intelligent Systems Demonstration (AAAI-04).
2003
- (Patwardhan et al., 2003) ⇒ (2003). Siddharth Patwardhan, Satanjeev Banerjee, and Ted Pedersen. “Using Measures of Semantic Relatedness for Word Sense Disambiguation.” In: Proceedings of the Fourth International Conference on Intelligent Text Processing and Computational Linguistics (CICLing 2003).
- (Rodríguez & Egenhofer) ⇒ M. Andrea Rodríguez, Max J. Egenhofer. (2003). “Determining Semantic Similarity among Entity Classes from Different Ontologies.” In: IIEEE Transactions on Knowledge and Data Engineering 15(2).
2001
- (Turney, 2001) ⇒ Peter D. Turney. (2001). “Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL.” In: Proceedings of the 12th European Conference on Machine Learning (ECML 2001). doi:10.1007/3-540-44795-4_42
- QUOTE: Various measures of semantic similarity between word pairs have been proposed, some using statistical (unsupervised learning from text) techniques [16, 17, 18], some using lexical databases (hand-built) [19, 20], and some hybrid approaches, combining statistics and lexical information [21, (Jiang & Conrath, 1997)]. Statistical techniques typically suffer from the sparse data problem: they perform poorly when the words are relatively rare, due to the scarcity of data. Hybrid approaches attempt to address this problem by supplementing sparse data with information from a lexical database [21, (Jiang & Conrath, 1997)].
- (Budanitsky & Hirst, 2001) ⇒ Alexander Budanitsky, and Graeme Hirst. (2001). “Semantic Distance in WordNet: An experimental, application-oriented evaluation of five measures.” In: Proceedings of the Workshop on WordNet and Other Lexical Resources at NAACL 2001.
- Subject Headings: Lexical Semantic Similarity Measure, Resnik Similarity, Hirst — St-Onge Similarity, Leacock Chodorow Similarity, Jiang-Conrath Similarity, Lin Similarity.
- ABSTRACT: Five different proposed measures of similarity or semantic distance in WordNet were experimentally compared by examining their performance in a real-word spelling correction system. It was found that Jiang and Conrath’smeasure gave the best results overall. That of Hirst and St-Onge seriously over-related, that of Resnik seriously under-related, and those of Lin and of Leacock and Chodorow fell in between.
1999
- (Resnik, 1999) ⇒ Philip Resnik. (1999). “Semantic Similarity in a Taxonomy: An Information-based Measure and its Application to Problems of Ambiguity in Natural Language.” In: Journal of Artificial Intelligence Research.
1998
- (Lin, 1998) ⇒ Dekang Lin. (1998). “An Information-Theoretic Definition of Similarity.” In: Proceedings of the 15th International Conference on Machine Learning (ICML 1998).
1997
- (Jiang and Conrath, 1997) ⇒ Jay J. Jiang, and David W. Conrath. (1997). “Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy.” In: Proceedings on International Conference on Research in Computational Linguistics.
- (Schütze & Silverstein, 1997) ⇒ Hinrich Schütze, and Craig Silverstein. (1997). “Projections for Efficient Document Clustering.” In: ACM SIGIR Forum.
1995
- (Resnik, 1995) ⇒ Philip Resnik. (1995). “Using Information Content to Evaluate Semantic Similarity in a Taxonomy.” In: Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI 1995).
- QUOTE: This paper presents a new measure of semantic similarity in an IS-A taxonomy, based on the notion of information content.
1990
- (Deerwester et al, 1990) ⇒ Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. L, Richard Harshman. (1990). “Indexing by Latent Semantic Analysis.” In: Journal of the American Society for Information Science
1989
- (Church et al., 1989) ⇒ Kenneth W. Church, and P. Hanks. (1989). “Word Association Norms, Mutual Information and Lexicography". In: Proceedings of the 27th Annual Conference of the Association of Computational Linguistics (ACL 1989).