MAUVE Score
Jump to navigation
Jump to search
A MAUVE Score is a text distribution similarity score created by a MAUVE metric.
- Context:
- It can measure the similarity between Machine-Generated and human-written text distributions.
- It can quantify how closely the generated text mimics the characteristics of human text by computing the divergence between the two distributions.
- It can employ a method that involves sampling text from both sources, embedding these samples using a language model (e.g., GPT-2), and then quantizing these embeddings into a lower-dimensional, discrete space to facilitate comparison.
- It can be particularly relevant for evaluating open-ended text generation models, such as those used in chatbots, story generation, and automatic content creation.
- It can assist in benchmarking and improving the performance of generative models in NLP by providing a quantitative measure of the gap between neural text and human text.
- It can correlate with human judgments, making it a valuable metric for assessing the quality of generated text in terms of fluency, coherence, and relevance.
- ...
- Example(s):
- ...
- Counter-Example(s):
- ROUGE Score, which focuses on the overlap between system-generated summaries and reference summaries.
- BLEU Score, primarily used in machine translation evaluation to measure precision of n-grams between the generated text and reference translations.
- METEOR Score, another metric used for machine translation but with adjustments for synonymy and stemming.
- See: MAUVE Algorithm, Generative Models, Evaluation Metrics in NLP, Text Generation, Open-Ended Text Generation.
References
2024
- GPT-4
- MAUVE Scores are utilized in Natural Language Processing (NLP) to measure the similarity between distributions of machine-generated text and human-written text. This metric is particularly relevant for evaluating open-ended text generation models by quantifying how closely the generated text mimics the characteristics of human text. MAUVE achieves this by computing the divergence between the two distributions, employing a method that involves sampling text from both sources, embedding these samples using a language model (e.g., GPT-2), and then quantizing these embeddings into a lower-dimensional, discrete space to facilitate comparison
2023
- (Pillutla et al., 2023) ⇒ Krishna Pillutla, Lang Liu, John Thickstun, Sean Welleck, Swabha Swayamdipta, Rowan Zellers, Sewoong Oh, Yejin Choi, and Zaid Harchaoui. (2023). “MAUVE Scores for Generative Models: Theory and Practice.” In: Journal of Machine Learning Research, 24(356).
- NOTE:
- It introduces MAUVE Scores for evaluating Generative Models against target distributions in Text Generation and Image Generation.
- It demonstrates MAUVE Score' correlation with Human Judgments and their ability to quantify known properties of generated texts and images.
- It compares MAUVE Score across various f-Divergences, showing flexibility and effectiveness in Generative Model Evaluation.
- It extends the application of MAUVE Scores beyond text to Image Generation, showing it can recover expected trends and correlate with established Evaluation Metrics.
- NOTE:
2021
- (Pillutla et al., 2021) ⇒ Krishna Pillutla, Swabha Swayamdipta, Rowan Zellers, John Thickstun, Sean Welleck, Yejin Choi, and Zaid Harchaoui. (2021). “MAUVE: Measuring the Gap Between Neural Text and Human Text Using Divergence Frontiers.” Advances in Neural Information Processing Systems, 34.