Meteor (Metric for Evaluation of Translation with Explicit ORdering) Score
(Redirected from Meteor score)
Jump to navigation
Jump to search
A Meteor (Metric for Evaluation of Translation with Explicit ORdering) Score is a NLP task performance measure that is based on the harmonic mean of unigrams' precision and recall.
- Context:
- It can be been developed by (Denkowski & Lavie, 2014).
- It can be the output of a Meteor Universal Scoring Task.
- It can be a Machine Translation Performance Measure, via alignment hypothesis-reference translation pairs.
- It can be a Text Summarization Performance Measure, by ...
- It can be defined as: $M=F_{mean}(1-p)$.
- $p = 0.5 \left (\frac{c}{u_{m}} \right )^3$ is a text segment's penalty where $c$ is the number of chunks and $u_m$ is the number of mapped unigrams;
- $F_{mean} = \frac{10PR}{R+9P}$ is the harmonic mean of the unigram's precision $P = \frac{m}{w_{t}}$ and recall $R = \frac{m}{w_{r}}$ where $m$ is the number of unigrams in both candidate and reference translations, and $w_r$ is the number of unigrams in the candidate translation.
- …
- Example(s):
- 2- systems Meteor score distributions for individual segments:
- …
- Counter-Example(s):
- See: 2014 ACL Workshop on Statistical Machine Translation, Statistical Machine Translation, Performance Metric, Similarity Score, WordNet Database, Sockeye Neural Machine Translation Toolkit, Word Error Rate (WER).
References
2020a
- (Wikipedia, 2020) ⇒ https://en.wikipedia.org/wiki/METEOR Retrieved:2020-11-22.
- METEOR (Metric for Evaluation of Translation with Explicit ORdering) is a metric for the evaluation of machine translation output. The metric is based on the harmonic mean of unigram precision and recall, with recall weighted higher than precision. It also has several features that are not found in other metrics, such as stemming and synonymy matching, along with the standard exact word matching. The metric was designed to fix some of the problems found in the more popular BLEU metric, and also produce good correlation with human judgement at the sentence or segment level. This differs from the BLEU metric in that BLEU seeks correlation at the corpus level.
Results have been presented which give correlation of up to 0.964 with human judgement at the corpus level, compared to BLEU's achievement of 0.817 on the same data set. At the sentence level, the maximum correlation with human judgement achieved was 0.403.
- METEOR (Metric for Evaluation of Translation with Explicit ORdering) is a metric for the evaluation of machine translation output. The metric is based on the harmonic mean of unigram precision and recall, with recall weighted higher than precision. It also has several features that are not found in other metrics, such as stemming and synonymy matching, along with the standard exact word matching. The metric was designed to fix some of the problems found in the more popular BLEU metric, and also produce good correlation with human judgement at the sentence or segment level. This differs from the BLEU metric in that BLEU seeks correlation at the corpus level.
2020b
- (Denkowski & Lavie, 2020) ⇒ Meteor Website: http://www.cs.cmu.edu/~alavie/METEOR/ Retrieved:2020-11-22.
- QUOTE: The Meteor automatic evaluation metric scores machine translation hypotheses by aligning them to one or more reference translations. Alignments are based on exact, stem, synonym, and paraphrase matches between words and phrases. Segment and system level metric scores are calculated based on the alignments between hypothesis-reference pairs. The metric includes several free parameters that are tuned to emulate various human judgment tasks including WMT ranking and NIST adequacy. The current version also includes a tuning configuration for use with MERT and MIRA. Meteor has extended support (paraphrase matching and tuned parameters) for the following languages: English, Czech, German, French, Spanish, and Arabic. Meteor is implemented in pure Java and requires no installation or dependencies to score MT output. On average, hypotheses are scored at a rate of 500 segments per second per CPU core. Meteor consistently demonstrates high correlation with human judgments in independent evaluations such as EMNLP WMT 2011 and NIST Metrics MATR 2010.
2014
- (Denkowski & Lavie, 2014) ⇒ Michael J. Denkowski, and Alon Lavie. (2014). “Meteor Universal: Language Specific Translation Evaluation for Any Target Language". In: Proceedings of the Ninth Workshop on Statistical Machine Translation (WMT@ACL 2014). DOI:10.3115/v1/W14-3348.
- QUOTE: Meteor evaluates translation hypotheses by aligning them to reference translations and calculating sentence-level similarity scores. For a hypothesis-reference pair, the space of possible alignments is constructed by exhaustively identifying all possible matches between the sentences according to the following matchers:
- Exact: Match words if their surface forms are identical.
- Stem: Stem words using a language appropriate Snowball Stemmer (...) and match if the stems are identical.
- Synonym: Match words if they share membership in any synonym set according to the WordNet database (...).
- Paraphrase: Match phrases if they are listed as paraphrases in a language appropriate paraphrase table (...).
- QUOTE: Meteor evaluates translation hypotheses by aligning them to reference translations and calculating sentence-level similarity scores. For a hypothesis-reference pair, the space of possible alignments is constructed by exhaustively identifying all possible matches between the sentences according to the following matchers: