CoNLL 2000 Dataset

Example(s):
- CoNLL 2000 Training Dataset [1],
- CoNLL 2000 Test Dataset [2].
- …
Counter-Example(s):
- GermEval 2014 Dataset.
See: Annotation Task, Word Embedding, Bidirectional LSTM-CNN-CRF Training System.

References

(CoNLL 2000, 2018) ⇒ https://www.clips.uantwerpen.be/conll2000/chunking/ Retrieved:2018-08-12
- QUOTE: Text chunking consists of dividing a text in syntactically correlated parts of words. For example, the sentence He reckons the current account deficit will narrow to only # 1.8 billion in September. can be divided as follows:
  [NP He] [VP reckons] [NP the current account deficit] [VP will narrow] [PP to] [NP only # 1.8 billion] [PP in ] [NP September].
  Text chunking is an intermediate step towards full parsing. It was the shared task for CoNLL-2000. Training and test data for this task is available. This data consists of the same partitions of the Wall Street Journal corpus (WSJ) as the widely used data for noun phrase chunking: sections 15-18 as training data (211727 tokens) and section 20 as test data (47377 tokens). The annotation of the data has been derived from the WSJ corpus by a program written by Sabine Buchholz from Tilburg University, The Netherlands.
  The goal of this task is to come forward with machine learning methods which after a training phase can recognize the chunk segmentation of the test data as well as possible. The training data can be used for training the text chunker. The chunkers will be evaluated with the F rate, which is a combination of the precision and recall rates: F = 2*precision*recall / (recall+precision) ^[1]. The precision and recall numbers will be computed over all types of chunks.