Automatic Input Rewriting Improves
Translation with Large Language Models

Dayeon Ki, Marine Carpuat
University of Maryland

This repository contains the code and dataset for our NAACL 2025 Main paper
Automatic Input Rewriting Improves Translation with Large Language Models.

👾 TL;DR

Can we improve machine translation with LLMs by rewriting their inputs automatically? We present an empirical study of 21 input rewriting methods for translating from English into 6 target languages, showing text simplification as the most effective MT-agnostic rewrite strategy.

📰 News

2025-01-22 Our paper is accepted to NAACL 2025! See you in New Mexico!

🗺️ Overview

We ask the following questions:

(1) Can we improve MT quality from LLMs by rewriting inputs for style?
(2) How can we guide models to rewrite inputs to improve their translatability?

We conduct an empirical study with 21 input rewriting methods with varying levels of MT-awareness on translation.
We first rewrite the source sentence using different rewriting methods, translate each rewrite in the target language, and then evaluate the rewrites on the basis of (i) translatability and (ii) meaning preservation.

Results

🚀 Quick Start

Data Preparation

We use the WMT-23 General MT task from Tower-Eval dataset. For the main experiments, we focus on three language pairs: English-German (en-de), English-Russian (en-ru), and English-Chinese (en-zh). The dataset is in: data/raw_{language_pair}.jsonl.

We translate each source sentence in English to the respective target language using Tower-Instruct LLM. The translated dataset is in: data/{language_pair}_mt_tower.jsonl.

MT-Agnostic Rewrite

MT-Agnostic rewrite methods leverage prior assumptions on what makes text easier to translate and do not take as input any signal of translatability or knowledge about the end-task. We consider three prompting variants here, all inspired by prior works on source rewriting:

Simplification: replacing complex words with simpler ones, rephrasing complex syntactic structures, or shortening sentences.
- code: mt_agnostic/simple_{llm}.py
Paraphrasing: LLMs might benefit MT by normalizing inputs using language patterns that are more frequent in LLM training data.
- code: mt_agnostic/paraphrase_{llm}.py
Stylistic transfer: use an off-the-shelf text editing tool CoEdit to rewrite inputs according to diverse style specifications, including fixing the grammar, making the text more coherent, making it easier to understand, and rewriting the text more formally.
- code: mt_agnostic/dipper_paraphrase.py
- code: mt_agnostic/coedit_{style}.py where style can be coherent, formal, gec, paraphrase, understand

Each code accepts the following arguments:

--model_name_hf: The name or path of a transformers-based pre-trained checkpoint. You can directly refer to the Huggingface model. This argument is only required for 1 (Simplification) or 2 (Paraphrasing).
--input_path: Path to input data file
--output_path: Save path of output file (after rewriting)
--cache_dir: Cache directory of pre-trained model checkpoints

Task-Aware Rewrite

We design prompts that account for the fact that rewrites are aimed at MT. Many prior works have shown that LLMs can post-edit errors in MT outputs and we were curious whether this ability can be extended to rewriting inputs to enhance translatability. We consider two prompting strategies:

Easy translation: prompt LLMs to rewrite inputs in a way that specifically facilitates translation in the target language.
- code: task_aware/easy_{llm}.py
Chain of thought (CoT): prompt LLMs to handle the entire rewriting and translation process in one sequence.
- code: task_aware/cot_{llm}.py

Translation

We translate each generated rewrite into respective target language using translate/translate_tower.py.

python -u translate/translate_tower.py \
  --model_name_hf Unbabel/TowerInstruct-7B-v0.2 \
  --input_path $PATH_TO_INPUT_FILE \
  --output_path $PATH_TO_OUTPUT_FILE \
  --tgt_lang $TARGET_LANGUAGE \
  --model_type $MODEL_TYPE \
  --cache_dir $PATH_TO_CACHE_DIR

Arguments for the translate code are as follows:

--model_name_hf: The name or path of a transformers-based pre-trained checkpoint. You can directly refer to the Huggingface model.
--input_path: Path to input data file.
--output_path: Save path of output file (after rewriting).
--tgt_lang: Target language (either German, Russian, or Chinese).
--model_type: Type of rewrite method (current code is set to simplification rewrite).
--cache_dir: Cache directory of pre-trained model checkpoints.

Evaluation

We use xCOMET and MetricX to evaluate different aspects of rewrite quality. Using the two models, we examine three different evaluation metrics:

Translatability: quality estimation (QE) score between the source and target
Meaning preservation: QE score between the target and reference translation
Overall translation quality: Reference-based score using source, target, and reference translation

We show an example of evaluation result for CoEdit style transfer evaluate. We can evaluate for other rewrite methods using the same code: evaluate/xcomet_mt_coedit.py (for translatability), evaluate/xcomet_ref_coedit.py (for meaning preservation), and evaluate/xcomet_mtref_coedit.py (for overall translation quality).

python -u evaluate/xcomet_mt_coedit.py \
  --input_path $PATH_TO_INPUT_FILE \
  --output_path $PATH_TO_OUTPUT_FILE \
  --cache_dir $PATH_TO_CACHE_DIR

Arguments for the evaluation code are as follows:

--input_path: Path to input data file.
--output_path: Save path of output file (after evaluation).
--cache_dir: Cache directory of pre-trained model checkpoints.

🤲 Citation

If you find our work useful in your research, please consider citing our work:

@inproceedings{ki-carpuat-2025-automatic,
    title = "Automatic Input Rewriting Improves Translation with Large Language Models",
    author = "Ki, Dayeon  and
      Carpuat, Marine",
    editor = "Chiruzzo, Luis  and
      Ritter, Alan  and
      Wang, Lu",
    booktitle = "Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)",
    month = apr,
    year = "2025",
    address = "Albuquerque, New Mexico",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.naacl-long.542/",
    doi = "10.18653/v1/2025.naacl-long.542",
    pages = "10829--10856",
    ISBN = "979-8-89176-189-6",
}

📧 Contact

For questions, issues, or collaborations, please reach out to dayeonki@umd.edu.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data		data
evaluate		evaluate
mt_agnositc		mt_agnositc
post_edit		post_edit
res		res
task_aware		task_aware
translate		translate
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Automatic Input Rewriting Improves
Translation with Large Language Models

👾 TL;DR

📰 News

✏️ Content

🗺️ Overview

Results

🚀 Quick Start

Data Preparation

MT-Agnostic Rewrite

Task-Aware Rewrite

Translation

Evaluation

🤲 Citation

📧 Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Automatic Input Rewriting Improves Translation with Large Language Models

👾 TL;DR

📰 News

✏️ Content

🗺️ Overview

Results

🚀 Quick Start

Data Preparation

MT-Agnostic Rewrite

Task-Aware Rewrite

Translation

Evaluation

🤲 Citation

📧 Contact

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Automatic Input Rewriting Improves
Translation with Large Language Models

Packages