Parallel Corpus
(Redirected from Parallel Corpora)
Jump to navigation
Jump to search
A Parallel Corpus is a text corpus that contains both source texts and their translations
- AKA: Parallel Corpora, Parallel Text.
- Context:
- It can range from being Bilingual Corpus to being a Multilingual Corpus.
- It can be an Aligned Parallel Corpus.
- Example(s)
- The Europarl, a parallel corpus extracted from the proceedings of the European Parliament [1]
- The United Nations Parallel Corpus.
- The ParaSol, a parallel corpus of Slavic and other languages [2].
- The English-Norwegian Parallel Corpus (ENPC).
- The English-German Translation Corpus.
- The English-Swedish Parallel Corpus (ESPC).
- The International Telecommunications Union Corpus (English-Spanish)
- The Intersect Parallel Corpus (English-French)
- The Multilingual Parallel Corpus (Danish, English, French, German, Greek, Italian, Finnish, Portuguese, Spanish, Swedish texts).
- …
- Counter-Example(s):
- A Comparable Corpus.
- A Monolingual Corpus
- A non-parallel Translation Corpus
- See: Text Corpus, Foreign Language Writing Aid, Machine Translation, Annotation, Part-of-Speech Tagging, Lemma (Morphology), Interlinear Gloss, Parsing, Treebank, Morphology (Linguistics), Semantics, Pragmatics, Corpus Linguistics.
References
2017a
- (Wikipedia, 2017) ⇒ https://en.wikipedia.org/wiki/Parallel_text Retrieved:2017-5-28.
- A parallel text is a text placed alongside its translation or translations. Parallel text alignment is the identification of the corresponding sentences in both halves of the parallel text. The Loeb Classical Library and the Clay Sanskrit Library are two examples of dual-language series of texts. Reference Bibles may contain the original languages and a translation, or several translations by themselves, for ease of comparison and study; Origen's Hexapla (Greek for "sixfold") placed six versions of the Old Testament side by side. The most famous example is the Rosetta Stone.
Large collections of parallel texts are called parallel corpora (see text corpus). Alignments of parallel corpora at sentence level are prerequisite for many areas of linguistic research.
During translation, sentences can be split, merged, deleted, inserted or reordered by the translator. This makes alignment a non-trivial task.
- A parallel text is a text placed alongside its translation or translations. Parallel text alignment is the identification of the corresponding sentences in both halves of the parallel text. The Loeb Classical Library and the Clay Sanskrit Library are two examples of dual-language series of texts. Reference Bibles may contain the original languages and a translation, or several translations by themselves, for ease of comparison and study; Origen's Hexapla (Greek for "sixfold") placed six versions of the Old Testament side by side. The most famous example is the Rosetta Stone.
2017b
- (Sammut & Webb, 2017) ⇒ Claude Sammut, and Geoffrey I. Webb. (2017). "Parallel Corpus". In: (Sammut & Webb, 2017).
- QUOTE: A parallel corpus (pl. corpora) is a document collection composed of two or more disjoint subsets, each written in a different language, such that documents in each subset are translations of documents in each other subset. Moreover, it is required that the translation relation is known, i.e., that given a document in one of the subset (i.e., languages), it is known what documents in the other subset are its translations. The statistical analysis of parallel corpora is at the heart of most methods for cross-language text mining.
2007
- (McEnery & Xiao, 2007) ⇒ McEnery, A., & Xiao, R. (2007). Parallel and comparable corpora: What is happening. Incorporating Corpora. The Linguist and the Translator, 18-31. http://core.ac.uk/download/pdf/71933.pdf
- (...) A parallel corpus can be defined as a corpus that contains source texts and their translations. Parallel corpora can be bilingual or multilingual. They can be uni-directional (e.g. from English into Chinese or from Chinese into English alone), bi-directional (e.g. containing both English source texts with their Chinese translations as well as Chinese source texts with their English translations), or multi-directional (e.g. the same piece of writing with English, French and German versions). In this sense, texts which are produced simultaneously in different languages (e.g. EU and UN regulations) also belong to the category of parallel corpora (cf. Hunston,2002: 15).
2005
- (Koehn, 2005) ⇒ Koehn, P. (2005, September). Europarl: A parallel corpus for statistical machine translation. In MT summit (Vol. 5, pp. 79-86) [3].
- Abstract: We collected a corpus of parallel text in 11 languages from the proceedings of the European Parliament, which are published on the web1. This corpus has found widespread use in the NLP community. Here, we focus on its acquisition and its application as training data for statistical machine translation (SMT). We trained SMT systems for 110 language pairs, which reveal interesting clues into the challenges ahead.
1997
- (Fung & McKeown, 1997) ⇒ Pascale Fung, and Kathleen R. McKeown. (1997). “A Technical Word- and Term-Translation Aid Using Noisy Parallel Corpora across Language Groups.” In: Journal Machine Translation, 12(1-2). doi:10.1023/A:1007974605290
- … This paper describes an algorithm for translating technical words and terms from noisy parallel corpora across language groups.