Should I Share this Translation?
Evaluating Quality Feedback for User Reliance on Machine Translation
Dayeon Ki1, Kevin Duh2, Marine Carpuat1
1University of Maryland, 2Johns Hopkins University
This repository contains the code and dataset for our EMNLP 2025 Main paper
Should I Share this Translation? Evaluating Quality Feedback for User Reliance on Machine Translation.
In our human study, each English-speaking monolingual participant reviews a sequence of 20 decision-making examples. Each example is shown in a two-step process: (1) Independent decision-making: Participants first make judgments based solely on the English source and its Spanish MT output and (2) AI-Assisted decision-making: They then reassess the same example with one of four randomly assigned feedback types. For each step, they respond to two questions: (i) Shareability: To the best of your knowledge, is the Spanish translation good enough to safely share with your Spanish-speaking neighbor? and (ii) Confidence: How confident are you in your assessment? The following figure is an illustration of our human study setup.
2025-08-20Our paper is accepted to EMNLP 2025!
As people increasingly use AI systems in work and daily life, mechanisms that help them use AI responsibly are urgently needed, especially when they are not equipped to verify AI predictions themselves. We study a realistic Machine Translation (MT) scenario where monolingual users decide whether to share an MT output, first without and then with quality feedback.
We compare four types of quality feedback: explicit feedback that directly give users an assessment of translation quality using (1) error highlights and (2) LLM explanations, and implicit feedback that helps users compare MT inputs and outputs through (3) backtranslation and (4) question–answer (QA) tables.
We explore four types of quality feedback in our human study. Detailed process used to generate each feedback is outlined in the paper.
We provide codebase for building our custom task interface in interface/. Code is written based on the interface code from EMNLP 2023 paper Explaining with Contrastive Phrasal Highlighting: A Case Study in Assisting Humans to Detect Translation Differences.
Go through the following steps to run the interface:
rm -rf tracker: Removing the tracker directory, which save files to track each condition to ensure each participant is randonly assigned to each of four conditions.mkdir tracker: Make new directory for the tracker.python create_tracker.py: Run the code to create tracker for each condition.python -u app.py > app.log: Run theapp.pyfile and log results toapp.logfile.
We measure three dependent variables in our paper: (1) Decision accuracy, (2) CWA (Confidence-Weighted Accuracy), and (3) Switch percentage.
- Decision accuracy is measured by comparing each participant's shareability judgment against the gold label for each example for all examples.
- CWA is measured by combining the decision accuracy and confidence scores using confidence weighting to evaluate whether participants made the correct decision weighted by their confidence in that decision. This metric serves as a measure of (in)appropriate reliance, where higher scores indicate accurate decisions made with well-calibrated confidence.
- Switch percentage is a widely used behavioral measure of reliance, capturing how often participants change their decisions after viewing AI feedback. In our context, it reflects how quality feedback influences final shareability judgments. We compute three metrics:
- Over-reliance: the proportion of cases where a participant changes from a correct to an incorrect decision after feedback
- Under-reliance: the proportion of cases where a participant does not change from an incorrect decision to a correct one after the quality feedback
- Appropriate reliance: the proportion of cases where a participant either corrects an incorrect decision after receiving feedback (switch) or maintains a correct decision (no switch)
From the collected responses, we evaluate the following:
evaluation/make_summary.py: Make a summary csv file from the raw responses for analysis. This code will generatesummary.csvfile, which will be used for further evaluation below.evaluation/dv_summary.py: Calculate each dependent variable (decision accuracy and CWA).evaluation/bonus_tracker.py: Track for participants who will receive performance-based bonus (over 70% overall accuracy).evaluation/free_comments.py: Analyze participants' free-form responses.evaluation/post_survey_analysis.py: Analyze participants' post-task survey questions on perceived helpfulness, trust in future use, and mental burden.evaluation/switch_percentage.py: Calculate breakdown of switch percentage.
We further test statistical significance for each dependent variable:
evaluation/significance_test/between_ind.py: Significance test between each independent decision-making performance across conditions.evaluation/significance_test/between_ai.py: Significance test between each AI-assisted decision-making performance across conditions.evaluation/significance_test/within.py: Significance test within each condition (Independent vs. AI-assisted).evaluation/significance_test/per_label.py: Significance test per shareability label (Safe to share as-is, Needs bilingual review before sharing).
We release our code used for creating visualizations in the paper:
(1) visualization/main_evaluation.py
(2) visualization/per_shareability.py
(3) visualization/switch_percentage.py
If you find our work useful in your research, please consider citing our work:
@inproceedings{ki-etal-2025-share,
title = "Should {I} Share this Translation? Evaluating Quality Feedback for User Reliance on Machine Translation",
author = "Ki, Dayeon and
Duh, Kevin and
Carpuat, Marine",
editor = "Christodoulopoulos, Christos and
Chakraborty, Tanmoy and
Rose, Carolyn and
Peng, Violet",
booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2025",
address = "Suzhou, China",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.emnlp-main.606/",
doi = "10.18653/v1/2025.emnlp-main.606",
pages = "12069--12092",
ISBN = "979-8-89176-332-6",
}
For questions, issues, or collaborations, please reach out to dayeonki@umd.edu.






