GEMBA for Machine Translation tasks

In the realm of translation quality assessment, GEMBA (GPT Estimation Metric-Based Assessment) offers an approach to evaluate translations. The findings are promising, not just for translation assessments but for other evaluations as well.

Introduction

The study behind GEMBA primarily investigates whether LLMs can effectively assess translation quality. However, this inquiry could be extended beyond translations, offering insights into the design of various assessments using Gen AI.

GEMBA operates both with and without a reference translation. By employing zero-shot prompting, the study compares four distinct prompt variants across two modes, depending on whether a reference translation is available. This method has been compared to the results from the WMT22 Metrics shared task, showcasing GEMBA’s ability to achieve state-of-the-art accuracy in assessing translations from English into German, English into Russian, and Chinese into English.

To use GEMBA for assessing translations, certain parameters are required:

Source Language.
Target Language.
Source Segments.
Candidate Translations.
Optional Reference Translations.

GEMBA can be used for different assessment needs:

For scoring tasks: GEMBA-DA and GEMBA-SQM
For classification tasks: GEMBA-stars and GEMBA-classes

Scoring Tasks

GEMBA-DA: Direct Assessment

Output scores range from 0 − 100.

Accuracy with GPT-4:

With human references: 89.8%
Without human references: 87.6%

Prompt for assessment with human references:

Score the following translation from {source_lang} to {target_lang}
with respect to the human reference on a continuous scale from 0 to 100,
where a score of zero means "no meaning preserved" and score of one
hundred means "perfect meaning and grammar".

{source_lang} source: "{source_seg}"
{target_lang} human reference: {reference_seg}
{target_lang} translation: "{target_seg}"
Score:

Prompt for assessment without human references:

Score the following translation from {source_lang} to {target_lang}
on a continuous scale from 0 to 100, where a score of zero means
"no meaning preserved" and score of one hundred means
"perfect meaning and grammar".

{source_lang} source: "{source_seg}"
{target_lang} translation: "{target_seg}"
Score:

GEMBA-SQM: Scalar Quality Metrics

Output scores range from 0 − 100.

Accuracy with GPT-4:

With human references: 88.7%
Without human references: 89.1%

Prompt for assessment with human references:

Score the following translation from {source_lang} to {target_lang}
with respect to the human reference on a continuous scale from 0 to 100
that starts with "No meaning preserved", goes through "Some meaning preserved",
then "Most meaning preserved and few grammar mistakes", up to "Perfect meaning
and grammar".

{source_lang} source: "{source_seg}"
{target_lang} human reference: "{reference_seg}"
{target_lang} translation: "{target_seg}"
Score (0-100):

Prompt for assessment without human references:

Score the following translation from {source_lang} to {target_lang} on
a continuous scale from 0 to 100 that starts with "No meaning preserved",
goes through "Some meaning preserved", then "Most meaning preserved and
few grammar mistakes", up to "Perfect meaning and grammar".

{source_lang} source: "{source_seg}"
{target_lang} translation: "{target_seg}"
Score (0-100):

Classification Tasks

GEMBA-Stars

Output scores range from 1 − 5. Special care is taken for answers containing non-numerical answers, such as “Three stars”, ”****”, or “1 star”.

Accuracy with GPT-4:

With human references: 91.2%
Without human references: 89.1%

Prompt for assessment with human references:

Score the following translation from {source_lang} to {target_lang} with
respect to the human reference with one to five stars. Where one star
means "Nonsense/No meaning preserved", two stars mean "Some meaning
preserved, but not understandable", three stars mean "Some meaning
preserved and understandable", four stars mean "Most meaning preserved
with possibly few grammar mistakes", and five stars mean "Perfect meaning
and grammar".

{source_lang} source: "{source_seg}"
{target_lang} human reference: "{reference_seg}"
{target_lang} translation: "{target_seg}"
Stars:

Prompt for assessment without human references:

Score the following translation from {source_lang} to {target_lang} with
one to five stars. Where one star means "Nonsense/No meaning preserved",
two stars mean "Some meaning preserved, but not understandable", three stars
mean "Some meaning preserved and understandable", four stars mean "Most
meaning preserved with possibly few grammar mistakes", and five stars mean
"Perfect meaning and grammar".

{source_lang} source: "{source_seg}"
{target_lang} translation: "{target_seg}"
Stars:

GEMBA-Classes

Output label one of “No meaning preserved”, “Some meaning preserved, but not understandable”, “Some meaning preserved and understandable”, “Most meaning preserved, minor issues”, “Perfect translation”.

Accuracy with GPT-4:

With human references: 89.1%
Without human references: 91.2%

Prompt for assessment with human references:

Classify the quality of translation from {source_lang} to {target_lang} with
respect to the human reference into one of following classes: "No meaning
preserved", "Some meaning preserved, but not understandable", "Some meaning
preserved and understandable", "Most meaning preserved, minor issues", "Perfect
translation".

{source_lang} source: "{source_seg}"
{target_lang} human reference: "{reference_seg}"
{target_lang} translation: "{target_seg}"
Class:

Prompt for assessment without human references:

Classify the quality of translation from {source_lang} to {target_lang} into one
of following classes: "No meaning preserved", "Some meaning preserved, but not
understandable", "Some meaning preserved and understandable", "Most meaning
preserved, minor issues", "Perfect translation".

{source_lang} source: "{source_seg}"
{target_lang} translation: "{target_seg}"
Class:

Conclusion

Protocol	Task type	Accuracy, %
GEMBA-DA	Scoring	89.8
GEMBA-DA[noref]	Scoring	87.6
GEMBA-SQM	Scoring	88.7
GEMBA-SQM[noref]	Scoring	89.1
GEMBA-Stars	Classification	91.2
GEMBA-Stars[noref]	Classification	89.1
GEMBA-Classes	Classification	89.1
GEMBA-Classes[noref]	Classification	91.2

GEMBA stands as a testament to the growing capabilities of LLMs in practical applications. By offering an adaptable method for translation quality assessment, it opens new avenues for practical language service applications.

Let’s keep an eye on WMT23 outcomes to discover the full potential of modern LLMs across various tasks.

Introduction

Scoring Tasks

GEMBA-DA: Direct Assessment

GEMBA-SQM: Scalar Quality Metrics

Classification Tasks

GEMBA-Stars

GEMBA-Classes

Conclusion

References