GermEval 2014 Dataset
Jump to navigation
Jump to search
A GermEval 2014 Dataset is a dataset for German Named Entity Recognition.
- AKA: GermEval 2014 NER Dataset.
- Context:
- It was built by GermEval 2014 Named Entity Recognition Shared Task.
- …
- Example(s):
- Counter-Example(s):
- See: Annotation Task, Word Embedding, NoSta-D, Bidirectional LSTM-CNN-CRF Training System.
References
2018
- (GermEval 2014 NER, 2018) ⇒ https://sites.google.com/site/germeval2014ner/data Retrieved:2018-08-12
- QUOTE: The GermEval 2014 NER Shared Task builds on a new dataset with German Named Entity annotation (Benikova, Biemann & Reznicek, 2014) with the following properties:
- The data was sampled from German Wikipedia and News Corpora as a collection of citations.
- The dataset covers over 31,000 sentences corresponding to over 590,000 tokens.
- The NER annotation uses the NoSta-D guidelines, which extend the Tübingen Treebank guidelines, using four main NER categories with sub-structure, and annotating embeddings among NEs such as [ORG FC Kickers [LOC Darmstadt]] (...)
- QUOTE: The GermEval 2014 NER Shared Task builds on a new dataset with German Named Entity annotation (Benikova, Biemann & Reznicek, 2014) with the following properties:
2014
- (Benikova, Biemann & Reznicek, 2014) ⇒ Darina Benikova, Chris Biemann, and Marc Reznicek (2014, May). "NoSta-D Named Entity Annotation for German: Guidelines and Dataset. In LREC (pp. 2524-2531).
- ABSTRACT: We describe the annotation of a new dataset for German Named Entity Recognition (NER). The need for this dataset is motivated by licensing issues and consistency issues of existing datasets. We describe our approach to creating annotation guidelines based on linguistic and semantic considerations, and how we iteratively refined and tested them in the early stages of annotation in order to arrive at the largest publicly available dataset for German NER, consisting of over 31,000 manually annotated sentences (over 591,000 tokens) from German Wikipedia and German online news. We provide a number of statistics on the dataset, which indicate its high quality, and discuss legal aspects of distributing the data as a compilation of citations. The data is released under the permissive CC-BY license, and will be fully available for download in September 2014 after it has been used for the GermEval 2014 shared task on NER. We further provide the full annotation guidelines and links to the annotation tool used for the creation of this resource.