CoNLL 2000 Dataset: Difference between revisions

From GM-RKB
Jump to navigation Jump to search
m (Text replacement - "]] Category" to " [[Category")
m (Text replacement - "<P> [[" to "<P>  [[")
 
Line 15: Line 15:
=== 2018 ===
=== 2018 ===
* (CoNLL 2000, 2018) ⇒ https://www.clips.uantwerpen.be/conll2000/chunking/ Retrieved:2018-08-12
* (CoNLL 2000, 2018) ⇒ https://www.clips.uantwerpen.be/conll2000/chunking/ Retrieved:2018-08-12
** QUOTE: [[Text chunking]] consists of dividing a text in [[syntactically correlated parts of word]]s. For example, the sentence <i>He reckons the current account deficit will narrow to only # 1.8 billion in September.</i> can be divided as follows:<P>&#91;NP <span style="color:red">He</span>&#93;  &#91;VP <span style="color:green">reckons</span>&#93; &#91;NP <span style="color:red">the current account deficit</span>&#93; &#91;VP <span style="color:green">will narrow</span>&#93; &#91;PP <span style="color:blue">to</span>&#93; &#91;NP <span style="color:red">only # 1.8 billion</span>&#93; &#91;PP <span style="color:blue">in</span> &#93; &#91;NP <span style="color:red">September</span>&#93;.        <P>       [[Text chunking]] is an [[intermediate step]] towards full [[parsing]]. It was the [[CoNLL-2000 Shared Task|shared task for CoNLL-2000]]. [[Training Dataset|Training]] and [[test data]] for [[CoNLL-2000 Shared Task|this task]] is available. This [[data]] consists of the same [[partition]]s of the [[Wall Street Journal corpus (WSJ)]] as the widely used [[data]] for [[noun phrase chunking]]: sections 15-18 as [[training data]] (211727 [[token]]s) and section 20 as [[test data]] (47377 [[token]]s). The [[annotation]] of the [[data]] has been derived from the [[WSJ corpus]] by a [[program]] written by [[Sabine Buchholz]] from [[Tilburg University, The Netherlands]].        <P>          The goal of [[CoNLL-2000 Shared Task|this task]] is to come forward with [[machine learning method]]s which after a [[training phase]] can recognize the [[chunk segmentation]] of the [[test data]] as well as possible. The [[training data]] can be used for [[training]] the [[text chunker]]. The [[chunker]]s will be evaluated with the [[F rate]], which is a combination of the [[precision]] and [[recall rate]]s: F = 2*precision*recall / (recall+precision) <ref name="Rij79">C.J. van Rijsbergen, "Information Retrieval". Buttersworth, 1979.</ref>. The [[precision]] and [[recall]] numbers will be computed over all types of chunks.
** QUOTE: [[Text chunking]] consists of dividing a text in [[syntactically correlated parts of word]]s. For example, the sentence <i>He reckons the current account deficit will narrow to only # 1.8 billion in September.</i> can be divided as follows:<P>&#91;NP <span style="color:red">He</span>&#93;  &#91;VP <span style="color:green">reckons</span>&#93; &#91;NP <span style="color:red">the current account deficit</span>&#93; &#91;VP <span style="color:green">will narrow</span>&#93; &#91;PP <span style="color:blue">to</span>&#93; &#91;NP <span style="color:red">only # 1.8 billion</span>&#93; &#91;PP <span style="color:blue">in</span> &#93; &#91;NP <span style="color:red">September</span>&#93;.        <P>         [[Text chunking]] is an [[intermediate step]] towards full [[parsing]]. It was the [[CoNLL-2000 Shared Task|shared task for CoNLL-2000]]. [[Training Dataset|Training]] and [[test data]] for [[CoNLL-2000 Shared Task|this task]] is available. This [[data]] consists of the same [[partition]]s of the [[Wall Street Journal corpus (WSJ)]] as the widely used [[data]] for [[noun phrase chunking]]: sections 15-18 as [[training data]] (211727 [[token]]s) and section 20 as [[test data]] (47377 [[token]]s). The [[annotation]] of the [[data]] has been derived from the [[WSJ corpus]] by a [[program]] written by [[Sabine Buchholz]] from [[Tilburg University, The Netherlands]].        <P>          The goal of [[CoNLL-2000 Shared Task|this task]] is to come forward with [[machine learning method]]s which after a [[training phase]] can recognize the [[chunk segmentation]] of the [[test data]] as well as possible. The [[training data]] can be used for [[training]] the [[text chunker]]. The [[chunker]]s will be evaluated with the [[F rate]], which is a combination of the [[precision]] and [[recall rate]]s: F = 2*precision*recall / (recall+precision) <ref name="Rij79">C.J. van Rijsbergen, "Information Retrieval". Buttersworth, 1979.</ref>. The [[precision]] and [[recall]] numbers will be computed over all types of chunks.


----
----

Latest revision as of 01:41, 27 February 2024

A CoNLL 2000 Dataset is a Text Chunking dataset developed by CoNLL 2000 Shared Task.



References

2018


  1. C.J. van Rijsbergen, "Information Retrieval". Buttersworth, 1979.