CoNLL 2000 Dataset: Difference between revisions

Latest revision as of 01:41, 27 February 2024

A CoNLL 2000 Dataset is a Text Chunking dataset developed by CoNLL 2000 Shared Task.

Example(s):
- CoNLL 2000 Training Dataset [1],
- CoNLL 2000 Test Dataset [2].
- …
Counter-Example(s):
- GermEval 2014 Dataset.
See: Annotation Task, Word Embedding, Bidirectional LSTM-CNN-CRF Training System.

References

2018

(CoNLL 2000, 2018) ⇒ https://www.clips.uantwerpen.be/conll2000/chunking/ Retrieved:2018-08-12
- QUOTE: Text chunking consists of dividing a text in syntactically correlated parts of words. For example, the sentence He reckons the current account deficit will narrow to only # 1.8 billion in September. can be divided as follows:
  [NP He] [VP reckons] [NP the current account deficit] [VP will narrow] [PP to] [NP only # 1.8 billion] [PP in ] [NP September].
  Text chunking is an intermediate step towards full parsing. It was the shared task for CoNLL-2000. Training and test data for this task is available. This data consists of the same partitions of the Wall Street Journal corpus (WSJ) as the widely used data for noun phrase chunking: sections 15-18 as training data (211727 tokens) and section 20 as test data (47377 tokens). The annotation of the data has been derived from the WSJ corpus by a program written by Sabine Buchholz from Tilburg University, The Netherlands.
  The goal of this task is to come forward with machine learning methods which after a training phase can recognize the chunk segmentation of the test data as well as possible. The training data can be used for training the text chunker. The chunkers will be evaluated with the F rate, which is a combination of the precision and recall rates: F = 2*precision*recall / (recall+precision) ^[1]. The precision and recall numbers will be computed over all types of chunks.

↑ C.J. van Rijsbergen, "Information Retrieval". Buttersworth, 1979.

[Rij79-1] C.J. van Rijsbergen, "Information Retrieval". Buttersworth, 1979.

[1]

@@ Line 15: / Line 15: @@
 === 2018 ===
 * (CoNLL 2000, 2018) ⇒ https://www.clips.uantwerpen.be/conll2000/chunking/ Retrieved:2018-08-12
-** QUOTE: [[Text chunking]] consists of dividing a text in [[syntactically correlated parts of word]]s. For example, the sentence <i>He reckons the current account deficit will narrow to only # 1.8 billion in September.</i> can be divided as follows:<P>&#91;NP <span style="color:red">He</span>&#93;  &#91;VP <span style="color:green">reckons</span>&#93; &#91;NP <span style="color:red">the current account deficit</span>&#93; &#91;VP <span style="color:green">will narrow</span>&#93; &#91;PP <span style="color:blue">to</span>&#93; &#91;NP <span style="color:red">only # 1.8 billion</span>&#93; &#91;PP <span style="color:blue">in</span> &#93; &#91;NP <span style="color:red">September</span>&#93;.         <P>        [[Text chunking]] is an [[intermediate step]] towards full [[parsing]]. It was the [[CoNLL-2000 Shared Task|shared task for CoNLL-2000]]. [[Training Dataset|Training]] and [[test data]] for [[CoNLL-2000 Shared Task|this task]] is available. This [[data]] consists of the same [[partition]]s of the [[Wall Street Journal corpus (WSJ)]] as the widely used [[data]] for [[noun phrase chunking]]: sections 15-18 as [[training data]] (211727 [[token]]s) and section 20 as [[test data]] (47377 [[token]]s). The [[annotation]] of the [[data]] has been derived from the [[WSJ corpus]] by a [[program]] written by [[Sabine Buchholz]] from [[Tilburg University, The Netherlands]].         <P>          The goal of [[CoNLL-2000 Shared Task|this task]] is to come forward with [[machine learning method]]s which after a [[training phase]] can recognize the [[chunk segmentation]] of the [[test data]] as well as possible. The [[training data]] can be used for [[training]] the [[text chunker]]. The [[chunker]]s will be evaluated with the [[F rate]], which is a combination of the [[precision]] and [[recall rate]]s: F = 2*precision*recall / (recall+precision) <ref name="Rij79">C.J. van Rijsbergen, "Information Retrieval". Buttersworth, 1979.</ref>. The [[precision]] and [[recall]] numbers will be computed over all types of chunks.
+** QUOTE: [[Text chunking]] consists of dividing a text in [[syntactically correlated parts of word]]s. For example, the sentence <i>He reckons the current account deficit will narrow to only # 1.8 billion in September.</i> can be divided as follows:<P>&#91;NP <span style="color:red">He</span>&#93;  &#91;VP <span style="color:green">reckons</span>&#93; &#91;NP <span style="color:red">the current account deficit</span>&#93; &#91;VP <span style="color:green">will narrow</span>&#93; &#91;PP <span style="color:blue">to</span>&#93; &#91;NP <span style="color:red">only # 1.8 billion</span>&#93; &#91;PP <span style="color:blue">in</span> &#93; &#91;NP <span style="color:red">September</span>&#93;.         <P>          [[Text chunking]] is an [[intermediate step]] towards full [[parsing]]. It was the [[CoNLL-2000 Shared Task|shared task for CoNLL-2000]]. [[Training Dataset|Training]] and [[test data]] for [[CoNLL-2000 Shared Task|this task]] is available. This [[data]] consists of the same [[partition]]s of the [[Wall Street Journal corpus (WSJ)]] as the widely used [[data]] for [[noun phrase chunking]]: sections 15-18 as [[training data]] (211727 [[token]]s) and section 20 as [[test data]] (47377 [[token]]s). The [[annotation]] of the [[data]] has been derived from the [[WSJ corpus]] by a [[program]] written by [[Sabine Buchholz]] from [[Tilburg University, The Netherlands]].         <P>          The goal of [[CoNLL-2000 Shared Task|this task]] is to come forward with [[machine learning method]]s which after a [[training phase]] can recognize the [[chunk segmentation]] of the [[test data]] as well as possible. The [[training data]] can be used for [[training]] the [[text chunker]]. The [[chunker]]s will be evaluated with the [[F rate]], which is a combination of the [[precision]] and [[recall rate]]s: F = 2*precision*recall / (recall+precision) <ref name="Rij79">C.J. van Rijsbergen, "Information Retrieval". Buttersworth, 1979.</ref>. The [[precision]] and [[recall]] numbers will be computed over all types of chunks.
 ----

CoNLL 2000 Dataset: Difference between revisions

Latest revision as of 01:41, 27 February 2024

References

2018

Navigation menu

Search