Language Model-based System Evaluation Task
(Redirected from LM Evaluation)
Jump to navigation
Jump to search
A Language Model-based System Evaluation Task is a NLP evaluation task for an LM-based system (using a language model on an NLP task).
- Example(s):
- See: Language Modeling, AI Evaluation Task.
References
2023
- (Wikipedia, 2023) ⇒ https://en.wikipedia.org/wiki/Language_model#Evaluation_and_benchmarks Retrieved:2023-5-8.
- Evaluation of the quality of language models is mostly done by comparison to human-created sample benchmarks created from typical language-oriented tasks. Other, less established, quality tests examine the intrinsic character of a language model or compare two such models. Since language models are typically intended to be dynamic and to learn from data it sees, some proposed models investigate the learning rate, e.g. through inspection of learning curves. Various data sets have been developed to use to evaluate language processing systems.
- These include:
- Corpus of Linguistic Acceptability[1]
- GLUE benchmark[2]
- Microsoft Research Paraphrase Corpus[3]
- Multi-Genre Natural Language Inference.
- Question Natural Language Inference.
- Quora Question Pairs[4]
- Recognizing Textual Entailment[5]
- Semantic Textual Similarity Benchmark.
- SQuAD question answering Test[6]
- Stanford Sentiment Treebank[7]
- Winograd NLI.
- BoolQ, PIQA, SIQA, HellaSwag, WinoGrande, ARC, OpenBookQA, NaturalQuestions, TriviaQA, RACE, MMLU (Measuring Massive Multitask Language Understanding), BIG-bench hard, GSM8k, RealToxicityPrompts, WinoGender, CrowS-Pairs.[8] (LLaMa Benchmark)
2019
- https://openreview.net/forum?id=HJePno0cYm¬eId=Hkla0-dp27
- QUOTE: This paper proposes a variant of transformer to train language model, ... Extensive experiments in terms of perplexity results are reported, specially on WikiText-103 corpus, significant perplexity reduction has been achieved.
Perplexity is not a gold standard for language model, the authors are encouraged to report experimental results on real world applications such as word rate reduction ASR on BLEU score improvement machine translation.
- QUOTE: This paper proposes a variant of transformer to train language model, ... Extensive experiments in terms of perplexity results are reported, specially on WikiText-103 corpus, significant perplexity reduction has been achieved.
- ↑ "The Corpus of Linguistic Acceptability (CoLA)". https://nyu-mll.github.io/CoLA/. Retrieved 2019-02-25.
- ↑ "GLUE Benchmark" (in en). https://gluebenchmark.com/. Retrieved 2019-02-25.
- ↑ "Microsoft Research Paraphrase Corpus" (in en-us). https://www.microsoft.com/en-us/download/details.aspx?id=52398. Retrieved 2019-02-25.
- ↑ Template:Citation
- ↑ Sammons, V.G.Vinod Vydiswaran, Dan Roth, Mark; Vydiswaran, V.G.; Roth, Dan. "Recognizing Textual Entailment". http://l2r.cs.uiuc.edu/~danr/Teaching/CS546-12/TeChapter.pdf. Retrieved February 24, 2019.
- ↑ "The Stanford Question Answering Dataset". https://rajpurkar.github.io/SQuAD-explorer/. Retrieved 2019-02-25.
- ↑ "Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank". https://nlp.stanford.edu/sentiment/treebank.html. Retrieved 2019-02-25.
- ↑ Template:Citation