CoNLL 2000 Dataset
A CoNLL 2000 Dataset is a Text Chunking dataset developed by CoNLL 2000 Shared Task.
- Example(s):
- Counter-Example(s):
- See: Annotation Task, Word Embedding, Bidirectional LSTM-CNN-CRF Training System.
References
2018
- (CoNLL 2000, 2018) ⇒ https://www.clips.uantwerpen.be/conll2000/chunking/ Retrieved:2018-08-12
- QUOTE: Text chunking consists of dividing a text in syntactically correlated parts of words. For example, the sentence He reckons the current account deficit will narrow to only # 1.8 billion in September. can be divided as follows:
[NP He] [VP reckons] [NP the current account deficit] [VP will narrow] [PP to] [NP only # 1.8 billion] [PP in ] [NP September].
Text chunking is an intermediate step towards full parsing. It was the shared task for CoNLL-2000. Training and test data for this task is available. This data consists of the same partitions of the Wall Street Journal corpus (WSJ) as the widely used data for noun phrase chunking: sections 15-18 as training data (211727 tokens) and section 20 as test data (47377 tokens). The annotation of the data has been derived from the WSJ corpus by a program written by Sabine Buchholz from Tilburg University, The Netherlands.
The goal of this task is to come forward with machine learning methods which after a training phase can recognize the chunk segmentation of the test data as well as possible. The training data can be used for training the text chunker. The chunkers will be evaluated with the F rate, which is a combination of the precision and recall rates: F = 2*precision*recall / (recall+precision) [1]. The precision and recall numbers will be computed over all types of chunks.
- QUOTE: Text chunking consists of dividing a text in syntactically correlated parts of words. For example, the sentence He reckons the current account deficit will narrow to only # 1.8 billion in September. can be divided as follows:
- ↑ C.J. van Rijsbergen, "Information Retrieval". Buttersworth, 1979.