There is no Data Like More Data Heuristic

From GM-RKB
Jump to navigation Jump to search

The There is no Data Like More Data Heuristic is a Heuristic from the NLP Community.



References

2006

2002

  • (Van den Bosck & Buchold, 2002) ⇒ Antal van den Bosch, and Sabine Buchholz. (2002). “Shallow Parsing on the Basis of Words only: a Case Study."
    • They train some models using features of different granularity and informative richness.
    • ABSTRACT: memory-based learning algorithm is trained to simultaneously chunk sentences and assign grammatical function tags to these chunks. We compare the algorithm’s performance on this parsing task with varying training set size s (yielding learning curves) and different input representations. In particular we compare input consisting of words only, a variant that includes word form information for low-frequency words, gold-standard POS only, and combinations of these. The word=based shallow parser displays an apparently log-linear increase in performance, and surpasses the flatter POS-based curve at about 50,000 sentences of training data. The low-frequency variant performs even better, and the combinations is best. Comparative experiments with a real POS tagger produce lower results. We argue that we might not need an explicit intermediate POS-tagging step for parsing when a sufficient amount of training material is available and word form information is used for low-frequency words.

2001