General Language Understanding Evaluation (GLUE) Benchmark
(Redirected from GLUE Benchmark)
Jump to navigation
Jump to search
A General Language Understanding Evaluation (GLUE) Benchmark is a NLP Benchmark for training, evaluating and analyzing Natural Language Understanding systems.
- Context:
- Its leadboard is available at: https://gluebenchmark.com/tasks
- It can range from being a GLUE Single Sentence Task, to being GLUE Similarity and Paraphrasing Task, to being a GLUE Natural Language Inference Task.
- It can range from being a GLUE Single-Task Training Evaluation System, to being a GLUE Multi-Task Training Evaluation System, to being a GLUE Pre-Training Evaluation System, GLUE Pre-Trained Sentence Representation Model Evaluation System.
- It can range from being GLUE BiLSTM Sentence Encoder to being Post-Attention BiLSTM System.
- It can use Natural Language Understanding Task's Corpura and Datasets, such as:
- Corpus of Linguistic Acceptability (CoLA) (Warstadt et al., 2018);
- Stanford Sentiment Treebank - Version 2 (SST-2) (Socher et al., 2013);
- Microsoft Research Paraphrase Corpus (MRPC) (Dolan & Brockett, 2005);
- Quora Question Pairs (QQP) Dataset;
- Semantic Textual Similarity Benchmark (STS-B) (Cer et al., 2017);
- Multi-Genre Natural Language Inference (MNLI) Corpus (Williams et al., 2018);
- Stanford Question Answering Dataset (QNLI); (Rajpurkar et al. 2016);
- Recognizing Textual Entailment (RTE) Datasets;
- Winograd Natural Language Inference (WNLI) Dataset) (Levesque et al., 2011).
- Example(s):
- GLUE BiLSTM+ELMO Single Task Training System,
- GLUE BiLSTM+ELMO Multi-Task Training System,
- GLUE BiLSTM+COVE Single Task Training System,
- GLUE BiLSTM+COVE Multi-Task Training System,
- GLUE Post-Attention BiLSTM+ELMO Single Task Training System,
- GLUE Post-Attention BiLSTM+ELMO Multi-Task Training System,
- GLUE Post-Attention BiLSTM+COVE Single Task Training System,
- GLUE Post-Attention BiLSTM+COVE Multi-Task Training System,
- GLUE Pre-Trained CBOW System,
- GLUE Pre-Trained Skip-Thought System,
- GLUE Pre-Trained InferSent System,
- GLUE Pre-Trained Dissent System,
- GLUE Pre-Trained GenSen System,
- Counter-Example(s):
- See: BERT System, OpenAI GPT System, Transfer Learning System, Natural Language Processing System, Natural Language Inference System, Deep Learning System, Machine Translation System, Artificial Neural Network.
References
2019a
- (Wang et al., 2019) ⇒ Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. (2019). “GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding.” In: Proceedings of the 7th International Conference on Learning Representations (ICLR 2019).
2019b
- (Glue Benchmark, 2019) ⇒ https://gluebenchmark.com/ Retrieved:2019-09-14.
- QUOTE: The General Language Understanding Evaluation (GLUE) benchmark is a collection of resources for training, evaluating, and analyzing natural language understanding systems. GLUE consists of:
- A benchmark of nine sentence- or sentence-pair language understanding tasks built on established existing datasets and selected to cover a diverse range of dataset sizes, text genres, and degrees of difficulty,
- A diagnostic dataset designed to evaluate and analyze model performance with respect to a wide range of linguistic phenomena found in natural language, and
- A public leaderboard for tracking performance on the benchmark and a dashboard for visualizing the performance of models on the diagnostic set.
- QUOTE: The General Language Understanding Evaluation (GLUE) benchmark is a collection of resources for training, evaluating, and analyzing natural language understanding systems. GLUE consists of:
- The format of the GLUE benchmark is model-agnostic, so any system capable of processing sentence and sentence pairs and producing corresponding predictions is eligible to participate. The benchmark tasks are selected so as to favor models that share information across tasks using parameter sharing or other transfer learning techniques. The ultimate goal of GLUE is to drive research in the development of general and robust natural language understanding systems.
2019c
- (Glue Benchmark, 2019) ⇒ https://gluebenchmark.com/ Retrieved:2019-09-14.
- QUOTE: The GLUE benchmark comes with a manually-curated evaluation dataset for fine-grained analysis of system performance on a broad range of linguistic phenomena.
This dataset evaluates sentence understanding through Natural Language Inference (NLI) problems. The NLI task is well-suited to our purposes because it can encompass a large set of skills involved in language understanding, from resolving syntactic ambiguity to high-level reasoning, while still supporting a straightforward evaluation(...)
- QUOTE: The GLUE benchmark comes with a manually-curated evaluation dataset for fine-grained analysis of system performance on a broad range of linguistic phenomena.
2018
- (Wang et al., 2018) ⇒ Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. (2018). “GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding.” In: Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. doi:10.18653/v1/W18-5446 arXiv:1804.07461