Europarl Corpus
Jump to navigation
Jump to search
An Europarl Corpus is a Parallel Corpus that is extracted from the proceedings of the European Parliament and used for Statistical Machine Translation.
- AKA: European Parliament Proceedings Parallel Corpus.
- Context:
- Website: http://www.statmt.org/europarl/
- It was initially built by Koehn (2005).
- Example(s):
- Counter-Example(s):
- See: SemEval-2017 Task 2, LREC 2012, Church and Gale Algorithm.
References
2021
- (Europarl, 2021) ⇒ http://www.statmt.org/europarl/ Retrieved:2021-07-25.
- QUOTE: The Europarl parallel corpus is extracted from the proceedings of the European Parliament. It includes versions in 21 European languages: Romanic (French, Italian, Spanish, Portuguese, Romanian), Germanic (English, Dutch, German, Danish, Swedish), Slavik (Bulgarian, Czech, Polish, Slovak, Slovene), Finni-Ugric (Finnish, Hungarian, Estonian), Baltic (Latvian, Lithuanian), and Greek.
The goal of the extraction and processing was to generate sentence aligned text for statistical machine translation systems. For this purpose we extracted matching items and labeled them with corresponding document IDs. Using a preprocessor we identified sentence boundaries. We sentence aligned the data using a tool based on the Church and Gale algorithm.
- QUOTE: The Europarl parallel corpus is extracted from the proceedings of the European Parliament. It includes versions in 21 European languages: Romanic (French, Italian, Spanish, Portuguese, Romanian), Germanic (English, Dutch, German, Danish, Swedish), Slavik (Bulgarian, Czech, Polish, Slovak, Slovene), Finni-Ugric (Finnish, Hungarian, Estonian), Baltic (Latvian, Lithuanian), and Greek.
2012
- (Tiedemann, 2012) ⇒ Jorg Tiedemann. (2012). “Parallel Data, Tools and Interfaces in OPUS.” In: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012).
2005
- (Koehn, 2005) ⇒ Philipp Koehn (2005). "Europarl: A Parallel Corpus for Statistical Machine Translation". In: MT Summit 2005.
- QUOTE: We collected a corpus of parallel text in 11 languages from the proceedings of the European Parliament (...)