Natural Language Generation (NLG) Performance Measure

A Natural Language Generation (NLG) Performance Measure is a linguistic processing performance measure for an NLG task (and NLG system).

Context:
- It can (typically) be referenced by an NLG Performance Evaluation Task.
- It can assess various aspects like Text Fluency, Text Coherence, Factual Accuracy, and Text Relevance (to the input context).
- It can range from being an Intrinsic NLG Performance Measure (evaluating the quality of generated text on its own) to being an Extrinsic NLG Performance Measure, assessing the impact of the generated text in a specific application or task.
- It can range from being an Intrinsic Language Generation Performance Measure to being an Language Generation Performance Extrinsic Measure
- It can range from being a Written Language Generation Performance Measure (such as text generation performance) to being a Spoken Language Generation Performance Measure.
- It can range from being a Manual Language Generation Performance Measure to being an Automated Language Generation Performance Measure to being a Machine-Learned Language Generation Performance Measure.
- It can range from being an Objective NLG Performance Measure (such as ROUGE) to being a Heuristic NLG Performance Measure (such as text coherence).
- It can range from being an Extrinsic NLG Performance Measure (such as NLG A/B test) to being an Intrinsic NLG Performance Measure (such as ROUGE).
- It can support NLG System Improvement.
- …
Example(s):
- a Text Generation Performance Measure, for text generation.
- a Question-Answering Performance Measure, for question answering.
- an Essay-Writing Performance Measure, for essay writing.
- BLEU, primarily used in machine translation, but also applicable in other NLG tasks.
- ROUGE, commonly used in text summarization to compare generated summaries against reference summaries.
- METEOR, which accounts for synonyms and stemming in evaluation.
- CIDEr, for evaluating image captioning by considering human consensus.
- Perplexity, used in language models to evaluate the likelihood of text sequences.
- BERTScore, which ...
- ...
Counter-Example(s):
- NLU Performance Measure, which evaluates comprehension tasks in natural language understanding.
- Information Retrieval Performance Measure, such as precision and recall in search tasks.
- Software Generation Performance.
See: Natural Language Generation, Text Summarization, Machine Translation, Language Model, Language Understanding Performance, Controlled-English Generation Task.

References

2011

(Crossley & McNamara, 2011) ⇒ Scott A. Crossley, and Danielle S. McNamara. (2011). “Understanding Expert Ratings of Essay Quality: Coh-Metrix Analyses of First and Second Language Writing.” International Journal of Continuing Engineering Education and Life Long Learning, 21(2-3).
- ABSTRACT: This article reviews recent studies in which human judgements of essay quality are assessed using Coh-Metrix, an automated text analysis tool. The goal of these studies is to better understand the relationship between linguistic features of essays and human judgements of writing quality. Coh-Metrix reports on a wide range of linguistic features, affording analyses of writing at various levels of text structure, including surface, text-base, and situation model levels. Recent studies have examined linguistic features of essay quality related to co-reference, connectives, syntactic complexity, lexical diversity, spatiality, temporality, and lexical characteristics. These studies have analysed essays written by both first language and second language writers. The results support the notion that human judgements of essay quality are best predicted by linguistic indices that correlate with measures of language sophistication such as lexical diversity, word frequency, and syntactic complexity. In contrast, human judgements of essay quality are not strongly predicted by linguistic indices related to cohesion. Overall, the studies portray high quality writing as containing more complex language that may not facilitate text comprehension.