Document Similarity Metric Learning Task
Jump to navigation
Jump to search
A Document Similarity Metric Learning Task is a similarity metric learning task to learn a document similarity scoring model (that can assess the semantic similarity between two or more text documents).
- Context:
- Task Input: Paired or grouped text documents as training data (e.g. similar document pairs, dissimilar pairs).
- Task Output: A document similarity scoring model that can quantify the semantic similarity between new documents.
- It focuses on modeling semantic similarity rather than just lexical overlap.
- It can rely on few document pairs as training data unlike supervised classifiers.
- It has applications in search, recommendation, duplicate detection etc.
- The model can generalize to new unseen documents based on learned notions of similarity.
- Performance is evaluated by correlation of model scores with human judgments of similarity.
- Phrase-based approaches may outperform word-overlap methods.
- Information-theoretic metrics can also be used as document similarity measures.
- …
- Example(s):
- A Scientific Paper Similarity Metric Learning Task to learn a scientific paper similarity scoring model, for example based on citations between papers.
- A Legal Document Similarity Metric Learning Task to learn a legal document similarity model, for example based on relevant vs non-relevant document pairs.
- A Paraphrase Detection Similarity Metric Learning Task to learn a paraphrase detection model, for example based on original vs paraphrased document pairs.
- A Resume Screening Similarity Metric Learning Task to learn a resume screening model, for example based on job description-resume pairs.
- A Duplicate Bug Report Detection Task to learn a duplicate bug report scoring model, for example based on duplicate vs non-duplicate bug report pairs.
- A News Article Similarity Metric Learning Task to learn a news article similarity model, for example based on articles covering the same event.
- A Patient Record Similarity Metric Learning Task to learn a patient record similarity model, for example based on related vs unrelated medical records.
- A Source Code Similarity Metric Learning Task to learn a source code similarity model, for example based on related vs unrelated code snippet pairs.
- A Sentence Similarity Metric Learning Task to learn a sentence similarity scoring model, for example based on original sentence pairs.
- …
- Counter-Example(s):
- Text Classification Task - predicts categories rather than similarity.
- Text Generation Task - generates text rather than assessing similarity.
- Information Extraction Task - extracts structured information rather than learning similarity patterns.
- …
- See: Document Embedding Model, Document Distance Metric, Similarity Learning.
References
2009
- (Ahlgren & Colliander, 2009) ⇒ Per Ahlgren and Cecilia Colliander. (2009). “Document–document similarity approaches and science mapping: Experimental comparison of five approaches." In: Journal of Informetrics 3.1: 1-12.
- QUOTE: "This paper treats document–document similarity approaches in the context of science mapping. Five approaches, involving nine methods, are compared experimentally. We compare ..."
- NOTE: Compares multiple approaches to Document Similarity Metric Learning experimentally in the context of science mapping applications.
2008
- (Chim & Deng, 2008) ⇒ Hoi Chim and Xiaotie Deng. (2008). “Efficient phrase-based document similarity for clustering." In: IEEE Transactions on Knowledge and Data Engineering 20.9 (2008): 1217-1229.
- QUOTE: "... we propose a phrase-based document similarity measure called VSM-EPM ... our experiments indicate that the VSM-EPM similarity is very effective for clustering tasks."
- NOTE: Proposes and evaluates a phrase-based approach to Document Similarity Metric Learning that outperforms word overlap methods.
2005
- (Lee et al., 2005) ⇒ Michael D. Lee, Brandon Pincombe, and Matthew Welsh. (2005). “An empirical evaluation of models of text document similarity." In: Proceedings of the 27th Annual Conference of the Cognitive Science Society. pp. 1254-1259.
- QUOTE: "... we experimentally compared a number of recently developed models of text document similarity. These include ... We evaluated their performance at predicting human judgments of text document similarity."
- NOTE: Empirically compares and evaluates different approaches to Document Similarity Metric Learning on standard datasets.
2003
- (Aslam & Frost, 2003) ⇒ Javed A. Aslam and Matthias Frost. (2003). “An Information-Theoretic Measure for Document Similarity." In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. pp. 449–450.
- QUOTE: "...test the effectiveness of an information-theoretic measure for pairwise document similarity. We adapt query retrieval to rate the quality of document similarity..."
- NOTE: Proposes and evaluates an information-theoretic approach to quantify Document Similarity for tasks like clustering and retrieval.