AI GLOSSARY

BLEU Score

BLEUEvaluation & Performance

A metric for evaluating the quality of machine-generated text, particularly translations, by comparing it to one or more human-written reference texts. It measures how many short sequences of words in the generated text also appear in the reference. BLEU is widely used but has real limitations, it does not capture meaning or fluency well, and high scores do not always correspond to high human-judged quality.
See also: benchmark, Evaluation.

External reference