Cross-Lingual Text Mining
Jump to navigation
Jump to search
A Cross-Lingual Text Mining is a category of text mining tasks for retrieving and accessing information from document collections written in several languages
- AKA: CLTM.
- Context:
- It ranges from being a task based on Latent Semantic Analysis to being Machine Translation Task.
- Example(s):
- See: Data Mining, Text Mining, Corpus, Ontology, Parallel Corpus, Comparable Corpus.
References
2011
- (Sammut & Webb, 2011) ⇒ Claude Sammut (editor), and Geoffrey I. Webb (editor). (2011). “Encyclopedia of Machine Learning." Springer. ISBN:0387307680
- QUOTE: (pg. 299): Definition: Cross-lingual text mining is a general category denoting tasks and methods for accessing the information in sets of documents written in several languages, or whenever the language used to express an information need is different from the language of the documents. A distinguishing feature of cross-lingual text mining is the necessity to overcome some language translation barrier. (...) A number of specific tasks fall under the term of Cross-lingual text mining (CLTM), including:
- These tasks can in principle be performed using methods which do not involve any Text Mining, but as a matter of fact all of them have been successfully approached relying on the statistical analysis of multilingual document collections, especially parallel corpora. While CLTM tasks differ in many respect, they are all characterized by the fact that they require to reliably measure the similarity of two text spans written in different languages. There are essentially two families of approaches for doing this:
- 1. In translation-based approaches one of the two text spans is first translated into the language of the other. Similarity is then computed based on any measure used in monolingual cases. As a variant, both text spans can be translated in a third pivot language.
- 2. In latent semantics approaches, an abstract vector space is defined based on the statistical properties of a parallel corpus (or, more rarely, of a comparable corpus). Both text spans are then represented as vectors in such latent semantic space, where any similarity measure for vector spaces can be used.