Comparable Corpus

From GM-RKB
(Redirected from comparable corpora)
Jump to navigation Jump to search

A Comparable Corpus is a collection of text corpora using the same sampling frame, topic or representation.



References

2017a

The criteria to define the similarity beteween texts is not clearly defined, but the aim of these type of corpora is to compare the languages or varieties presented in similar circumstances of communication, without the distorsions which appear in translated texts of Parallel Corpora
Examples of comparable corpora are those mirrored on the Brown corpus of Standard American English, for example, the LOB Corpus (British English), and the Kolhapur Corpus (Indian English).
Within the ICE Project (International Corpus of English), twelve centres around the world are preparing corpora of their own national or regional variety of English. The first of these (ICE-GB) will be available from spring 1998.

2017b

  • (Wikipedia, 2017) ⇒ https://en.wikipedia.org/wiki/Text_corpus#Overview Retrieved:2017-5-28.
    • (...) In a comparable corpus, the texts are of the same kind and cover the same content, but they are not translations of each other. To exploit a parallel text, some kind of text alignment identifying equivalent text segments (phrases or sentences) is a prerequisite for analysis. Machine translation algorithms for translating between two languages are often trained using parallel fragments comprising a first language corpus and a second language corpus which is an element-for-element translation of the first language corpus. In order to make the corpora more useful for doing linguistic research, they are often subjected to a process known as annotation. An example of annotating a corpus is part-of-speech tagging, or POS-tagging, in which information about each word's part of speech (verb, noun, adjective, etc.) is added to the corpus in the form of tags. Another example is indicating the lemma (base) form of each word. When the language of the corpus is not a working language of the researchers who use it, interlinear glossing is used to make the annotation bilingual. Some corpora have further structured levels of analysis applied. In particular, a number of smaller corpora may be fully parsed. Such corpora are usually called Treebanks or Parsed Corpora. The difficulty of ensuring that the entire corpus is completely and consistently annotated means that these corpora are usually smaller, containing around one to three million words. Other levels of linguistic structured analysis are possible, including annotations for morphology, semantics and pragmatics. Corpora are the main knowledge base in corpus linguistics. The analysis and processing of various types of corpora are also the subject of much work in computational linguistics, speech recognition and machine translation, where they are often used to create hidden Markov models for part of speech tagging and other purposes. Corpora and frequency lists derived from them are useful for language teaching. Corpora can be considered as a type of foreign language writing aid as the contextualised grammatical knowledge acquired by non-native language users through exposure to authentic texts in corpora allows learners to grasp the manner of sentence formation in the target language, enabling effective writing.[1]
  1. Yoon, H., & Hirvela, A. (2004). ESL Student Attitudes toward Corpus Use in L2 Writing. Journal Of Second Language Writing, 13(4), 257–283. Retrieved 21 March 2012.

2017c

2014

(...) Parallel sentences may also be mined from comparable corpora such as news stories written on the same topic in different languages. Munteanu and Marcu (2002)[1] uses suffix trees, and in later work log-likelyhood ratios (Munteanu et al., 2004; Munteanu and Marcu, 2005), to detect parallel sentences.
Abdul-Rauf and Schwenk (2009); Rauf and Schwenk (2009); Rauf and Schwenk (2011) translate one side of the comparable corpus into the other language, use information retrieval methods to find matching sentences and use the TER metric to measure their similarity. \,Stef\uanescu et al. (2012) report improvements with a more complex sentence similarity measure.
Instead of full sentences, parallel sentence fragments may be extracted from comparable corpora (Munteanu and Marcu, 2006). Methods have been proposed to extract matching phrases (Tanaka, 2002) or web pages (Smith, 2002) from such large collections. Quirk et al. (2007) propose a generative model for the same task.
Hewavitharana and Vogel (2011) extract phrase pairs from comparable corpora, using a classifier approach.

2011

2010

2007

  • (McEnery & Xiao, 2007) ⇒ McEnery, A., & Xiao, R. (2007). Parallel and comparable corpora: What is happening. Incorporating Corpora. The Linguist and the Translator, 18-31. http://core.ac.uk/download/pdf/71933.pdf
    • (...) In contrast, a comparable corpus can be defined as a corpus containing components that are collected using the same sampling frame and similar balance and representativeness (cf. McEnery, 2003: 450), e.g. the same proportions of the texts of the same genres in the same domains in a range of different languages in the same sampling period. However, the subcorpora of a comparable corpus are not translations of each other. Rather, their comparability lies in their same sampling frame and similar

balance.


  1. Munteanu, Dragos Stefan and Marcu, Daniel (2002): Processing Comparable Corpora With Bilingual Suffix Trees, Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) DOI:10.3115/1118693.1118730