From 9e4a61acc1939c374b5942d8e5b1cab8a62a8b1c Mon Sep 17 00:00:00 2001
From: "Christopher L." <157066905+cluebbers@users.noreply.github.com>
Date: Mon, 26 May 2025 09:52:57 +0200
Subject: [PATCH] Add warning about METEOR version and flag-induced score
 variance
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

This pull request adds a call-out block at the top of `evaluate/metrics/meteor/README.md` to inform users about significant score discrepancies in METEOR evaluations. Variations up to ±10 points can occur due to differences between versions 1.0 and 1.5, as well as the use of specific flags (`-l`, `-norm`, `-vOut`). By highlighting this issue, users are encouraged to specify the Java package version and document the flags used to ensure reproducibility. The information is based on findings from Lübbers (2024), available at [https://github.com/cluebbers/Reproducibility-METEOR-NLP](https://github.com/cluebbers/Reproducibility-METEOR-NLP).
---
 metrics/meteor/README.md | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/metrics/meteor/README.md b/metrics/meteor/README.md
index 0b234f70f..4d4f21c74 100644
--- a/metrics/meteor/README.md
+++ b/metrics/meteor/README.md
@@ -116,6 +116,9 @@ While the correlation between METEOR and human judgments was measured for Chines
 
 Furthermore, while the alignment and matching done in METEOR is based on unigrams, using multiple word entities (e.g. bigrams) could contribute to improving its accuracy -- this has been proposed in [more recent publications](https://www.cs.cmu.edu/~alavie/METEOR/pdf/meteor-naacl-2010.pdf) on the subject.
 
+Scores differ by up to **±10 points** across v1.0↔v1.5 and flag combinations (`-l`, `-norm`, `-vOut`). 
+Pin the Java package and document your flags. This uses the NLTK implementation (METEOR v1.0).
+[Lübbers, 2024](https://github.com/cluebbers/Reproducibility-METEOR-NLP)
 
 ## Citation