Language Model-based System Evaluation Task

From GM-RKB

(Redirected from language model evaluation)

Jump to navigation Jump to search

A Language Model-based System Evaluation Task is a NLP evaluation task for an LM-based system (using a language model on an NLP task).

Example(s):
- MMLU (Measuring Massive Multitask Language Understanding).
- ...
See: Language Modeling, AI Evaluation Task.

References

2023

(Wikipedia, 2023) ⇒ https://en.wikipedia.org/wiki/Language_model#Evaluation_and_benchmarks Retrieved:2023-5-8.
- Evaluation of the quality of language models is mostly done by comparison to human-created sample benchmarks created from typical language-oriented tasks. Other, less established, quality tests examine the intrinsic character of a language model or compare two such models. Since language models are typically intended to be dynamic and to learn from data it sees, some proposed models investigate the learning rate, e.g. through inspection of learning curves. Various data sets have been developed to use to evaluate language processing systems.
- These include:
  - Corpus of Linguistic Acceptability^[1]
  - GLUE benchmark^[2]
  - Microsoft Research Paraphrase Corpus^[3]
  - Multi-Genre Natural Language Inference.
  - Question Natural Language Inference.
  - Quora Question Pairs^[4]
  - Recognizing Textual Entailment^[5]
  - Semantic Textual Similarity Benchmark.
  - SQuAD question answering Test^[6]
  - Stanford Sentiment Treebank^[7]
  - Winograd NLI.
  - BoolQ, PIQA, SIQA, HellaSwag, WinoGrande, ARC, OpenBookQA, NaturalQuestions, TriviaQA, RACE, MMLU (Measuring Massive Multitask Language Understanding), BIG-bench hard, GSM8k, RealToxicityPrompts, WinoGender, CrowS-Pairs.^[8] (LLaMa Benchmark)

2019

https://openreview.net/forum?id=HJePno0cYm&noteId=Hkla0-dp27
- QUOTE: This paper proposes a variant of transformer to train language model, ... Extensive experiments in terms of perplexity results are reported, specially on WikiText-103 corpus, significant perplexity reduction has been achieved.
  Perplexity is not a gold standard for language model, the authors are encouraged to report experimental results on real world applications such as word rate reduction ASR on BLEU score improvement machine translation.

↑ "The Corpus of Linguistic Acceptability (CoLA)". https://nyu-mll.github.io/CoLA/. Retrieved 2019-02-25.
↑ "GLUE Benchmark" (in en). https://gluebenchmark.com/. Retrieved 2019-02-25.
↑ "Microsoft Research Paraphrase Corpus" (in en-us). https://www.microsoft.com/en-us/download/details.aspx?id=52398. Retrieved 2019-02-25.
↑ Template:Citation
↑ Sammons, V.G.Vinod Vydiswaran, Dan Roth, Mark; Vydiswaran, V.G.; Roth, Dan. "Recognizing Textual Entailment". http://l2r.cs.uiuc.edu/~danr/Teaching/CS546-12/TeChapter.pdf. Retrieved February 24, 2019.
↑ "The Stanford Question Answering Dataset". https://rajpurkar.github.io/SQuAD-explorer/. Retrieved 2019-02-25.
↑ "Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank". https://nlp.stanford.edu/sentiment/treebank.html. Retrieved 2019-02-25.
↑ Template:Citation

Retrieved from "http://www.gabormelli.com/RKB/index.php?title=Language_Model-based_System_Evaluation_Task&oldid=865440"