Meteor (Metric for Evaluation of Translation with Explicit ORdering) Score

Context:
- It can be been developed by (Denkowski & Lavie, 2014).
- It can be the output of a Meteor Universal Scoring Task.
- It can be a Machine Translation Performance Measure, via alignment hypothesis-reference translation pairs.
- It can be a Text Summarization Performance Measure, by ...
- It can be defined as: $M=F_{mean}(1-p)$.
  - $p = 0.5 \left (\frac{c}{u_{m}} \right )^3$ is a text segment's penalty where $c$ is the number of chunks and $u_m$ is the number of mapped unigrams;
  - $F_{mean} = \frac{10PR}{R+9P}$ is the harmonic mean of the unigram's precision $P = \frac{m}{w_{t}}$ and recall $R = \frac{m}{w_{r}}$ where $m$ is the number of unigrams in both candidate and reference translations, and $w_r$ is the number of unigrams in the candidate translation.
- …
Example(s):
- 2- systems Meteor score distributions for individual segments:
- …
Counter-Example(s):
- BLEU Metric,
- NIST Metric,
- ROUGE Metric.
See: 2014 ACL Workshop on Statistical Machine Translation, Statistical Machine Translation, Performance Metric, Similarity Score, WordNet Database, Sockeye Neural Machine Translation Toolkit, Word Error Rate (WER).

References

(Wikipedia, 2020) ⇒ https://en.wikipedia.org/wiki/METEOR Retrieved:2020-11-22.
- METEOR (Metric for Evaluation of Translation with Explicit ORdering) is a metric for the evaluation of machine translation output. The metric is based on the harmonic mean of unigram precision and recall, with recall weighted higher than precision. It also has several features that are not found in other metrics, such as stemming and synonymy matching, along with the standard exact word matching. The metric was designed to fix some of the problems found in the more popular BLEU metric, and also produce good correlation with human judgement at the sentence or segment level. This differs from the BLEU metric in that BLEU seeks correlation at the corpus level.
  Results have been presented which give correlation of up to 0.964 with human judgement at the corpus level, compared to BLEU's achievement of 0.817 on the same data set. At the sentence level, the maximum correlation with human judgement achieved was 0.403.

(Denkowski & Lavie, 2020) ⇒ Meteor Website: http://www.cs.cmu.edu/~alavie/METEOR/ Retrieved:2020-11-22.
- QUOTE: The Meteor automatic evaluation metric scores machine translation hypotheses by aligning them to one or more reference translations. Alignments are based on exact, stem, synonym, and paraphrase matches between words and phrases. Segment and system level metric scores are calculated based on the alignments between hypothesis-reference pairs. The metric includes several free parameters that are tuned to emulate various human judgment tasks including WMT ranking and NIST adequacy. The current version also includes a tuning configuration for use with MERT and MIRA. Meteor has extended support (paraphrase matching and tuned parameters) for the following languages: English, Czech, German, French, Spanish, and Arabic. Meteor is implemented in pure Java and requires no installation or dependencies to score MT output. On average, hypotheses are scored at a rate of 500 segments per second per CPU core. Meteor consistently demonstrates high correlation with human judgments in independent evaluations such as EMNLP WMT 2011 and NIST Metrics MATR 2010.