General Language Understanding Evaluation (GLUE) Benchmark

AKA: GLUE Benchmark, General Language Understanding Evaluation.
Context:
- Task Input: Sentence pairs (e.g., premise and hypothesis).
- Optional Input: metadata or instructions (e.g., entailment classification).
- Task Output: Predicted label (e.g., entailment, contradiction, sentiment).
- Task Performance Measure/Metrics: Accuracy, F1 score, Matthew’s Correlation.
- Its leadboard is available at: https://gluebenchmark.com/tasks
- It can range from being a GLUE Single Sentence Task, to being GLUE Similarity and Paraphrasing Task, to being a GLUE Natural Language Inference Task.
- It can range from being a GLUE Single-Task Training Evaluation System, to being a GLUE Multi-Task Training Evaluation System, to being a GLUE Pre-Training Evaluation System, GLUE Pre-Trained Sentence Representation Model Evaluation System.
- It can range from being GLUE BiLSTM Sentence Encoder to being Post-Attention BiLSTM System.
- It can use Natural Language Understanding Task's Corpura and Datasets, such as:
Example(s):
Counter-Example(s):
See: GLUE Dataset BERT System, OpenAI GPT System, Transfer Learning System, Natural Language Processing System, Natural Language Inference System, Deep Learning System, Machine Translation System, Artificial Neural Network.

References

(HuggingFaceH4, 2023) ⇒ HuggingFaceH4. (2023). "GLUE Dataset". In: Hugging Face.
- QUOTE: The GLUE dataset is a collection of nine natural language understanding tasks, each designed to evaluate specific aspects of language model performance.
  It is widely used for benchmarking and fine-tuning large-scale text models.

The format of the GLUE benchmark is model-agnostic, so any system capable of processing sentence and sentence pairs and producing corresponding predictions is eligible to participate. The benchmark tasks are selected so as to favor models that share information across tasks using parameter sharing or other transfer learning techniques. The ultimate goal of GLUE is to drive research in the development of general and robust natural language understanding systems.

(Glue Benchmark, 2019) ⇒ https://gluebenchmark.com/ Retrieved:2019-09-14.
- QUOTE: The GLUE benchmark comes with a manually-curated evaluation dataset for fine-grained analysis of system performance on a broad range of linguistic phenomena.
  This dataset evaluates sentence understanding through Natural Language Inference (NLI) problems. The NLI task is well-suited to our purposes because it can encompass a large set of skills involved in language understanding, from resolving syntactic ambiguity to high-level reasoning, while still supporting a straightforward evaluation(...)