This repository contains our WikiExpl dataset, a semi-automatic collection of naturally occurring explicitations in Wikipedia bitext corpus annotated by human translators, from our EMNLP 2023 main conference paper (arXiv).
The json files contain the candidates extracted by our detection algorithm.
Each candidate is annotated by three annotators and we assign the label based on the majority vote.
We consider the candidates as final explicitation if two or more annotators agree.
The list of final explicitation is in expl_idx_list. We merge the annotated span of explicitation from different annotators by maximizing the span coverage.
We provide simple tools for easy exploration:
$ python show.pyThe output example :
Here the red part in the source text (green) is that which is to be performed explicitation in the corresponding target translation, and the red part in the target text (blue) is its explicitation.
Note: The most recent code in this repo was written in June 2023.
$ conda create -n autoexpl python=3.9.12
# module load cuda/11.3.1 cudnn/v8.2.1
$ conda activate autoexpl
$ pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113
$ pip install -r requirements.txt
First, This section describes how we collect and examine naturally-occurring explicitations in bitexts commonly used as MT training data. The resulting WikiExpl corpus lets us reason about when explicitation is necessary and lets our automatic explicitation method learn how to generate it.
The detection code and data will be provided in the near future. The decision code and its application to the XQB dataset are available and described below.
This section builds on WikiExpl to explore generating explicitations automatically.
We use the XQB dataset, which contains parallel question pairs in English and various non-English languages.
For more details of XQB dataset and Quizbowl task, please refer to each links.
The original XQB dataset is from https://github.com/h-j-han/simqa and original quizbowl evaluation code is from https://github.com/Pinafore/qb.
We use the same decision algorithm to determine whether explicitation is needed, and apply it to generate automatic explicitations on the XQB-pl/es (Polish/Spanish-to-English Quizbowl Dataset).
First, detect named entities (NERs) in the sentences. (The default parameter settings are for XQB-es. Use additional parameter options for XQB-pl.)
python autoexpl/save_ner.py
This script outputs files to: xqb_eval/exp_gen/nerdict.*.pkl.
Next, based on the NER information, determine whether explicitation should be performed and extract the corresponding Wikidata ID for the selected named entity.
Note: This step may take some time and require larger memory.
(The default parameter settings and the script examples below are for XQB-es. Use additional parameter options for XQB-pl.)
Download the file lang_title2wikidataID-normalized_with_redirect.pkl from https://github.com/facebookresearch/GENRE and place it in the models/genre
Then, run:
python autoexpl/save_wikidataid.py --lang es --sent-file xqb_eval/exp_gen/esqbv1.sent.es
python autoexpl/save_wikidataid.py --lang en --sent-file xqb_eval/exp_gen/esqbv1.sent.en
This script outputs files to: xqb_eval/exp_gen/wikiid.*.pkl.
Finally, generate the explicitation for the selected named entities by retrieving information based on their Wikidata IDs.
Note: The generated explicitations may not exactly replicate those in the original 2023 dataset, as entity descriptions on Wikidata may have changed over time.
(The default parameter settings are for XQB-es. Use additional parameter options for XQB-pl.)
python autoexpl/xqb/gen_exp_xqb_pair_charskip.py
This script outputs files to: xqb_eval/extrinsic/esqbv1htall.pair_coment_charentskip_dedup_gent4.*.json and xqb_eval/esqbv1htall.pair_coment_charentskip_dedup_gent4.esen.exp.json
xqb_eval/plqbv1ht512.pair_coment_charentskip_dedup_gent4.xxxx.exp.json: contains all the details of explicitation from which entity in which part of the question to what types of explicitation and what explanation to be added.
orig_qidcorresponds to question id in XQB.ent_idis the id of an entity detected in the question text, appending two digits onoriq_qid.exp_idis a different version of explicitation generations, appending two digits onent_id.
xqb_eval/extrinsic/plqbv1ht512.pair_coment_charentskip_dedup_gent4.xx.json: realization of each explicitation within the question based on *.xxxx.exp.json. This is to be consumed by the guesser.
qanta_idisexp_idabove.
Try `python autoexpl/XQB/display_explicitation.py' for highlighted text display of XQB explicitations.
We are using LLaMA model to generate guesses from https://github.com/meta-llama/llama/tree/57b0eb62de0636e75af471e49e2f1862d908d9d8. You can git submodule update --init --recursive to get a certain repo.
In xqb_eval/extrinsic/rawresult*, we provided our results.
If you want to replicate our result from scratch, the model download is required.
$ ./gen_guess_llama.sh # set start index and end index of the questions if it takes too long for the entire set
You can merge pieces with autoexpl/XQB/merge_split_guess.py.
Input is rawresult* and output is LLaMA*
python autoexpl/XQB/parse_raw_guesses_llama_step.py --lang pl --dataset-name plqbv1ht512 --ckpt-dir /13B
python autoexpl/XQB/parse_raw_guesses_llama_step.py --lang en --dataset-name plqbv1ht512 --ckpt-dir /13B
python autoexpl/XQB/parse_raw_guesses_llama_step.py --lang es --dataset-name esqbv1htall --ckpt-dir /7B
python autoexpl/XQB/parse_raw_guesses_llama_step.py --lang en --dataset-name esqbv1htall --ckpt-dir /7B
Input is LLaMA* and output is origvsexp*
python autoexpl/XQB/gather_guessbuzz_llama.py
Finally, we provide plot.ipynb to reproduce the plots in our paper.
Three human evaluation results are in xqb_eval/intrinsic.
All entitiys were shown to annotators to evaluate is the decisions on the entity was helpful or not.
An entity has three version of explicitation with different generation types and integration. Only one explicitation generation per entity was shown to one annotator.
@inproceedings{han-etal-2023-auto-explicitation,
title = "Bridging Background Knowledge Gaps in Translation with Automatic Explicitation",
author = "Han, HyoJung and Boyd-Graber, Jordan and Carpuat, Marine",
booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing",
month = dec,
year = "2023",
address = "Singapore, Singapore",
publisher = "Association for Computational Linguistics",
url = "https://openreview.net/pdf?id=PBvSGqYCSa",
}
If you also use intrinsic evaluation result or follow extrinsic evaluation with XQB, please cite
@inproceedings{han-etal-2022-simqa,
title = "{S}im{QA}: Detecting Simultaneous {MT} Errors through Word-by-Word Question Answering",
author = "Han, HyoJung and
Carpuat, Marine and
Boyd-Graber, Jordan",
booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing",
month = dec,
year = "2022",
address = "Abu Dhabi, United Arab Emirates",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.emnlp-main.378",
pages = "5598--5616",
abstract = "Detractors of neural machine translation admit that while its translations are fluent, it sometimes gets key facts wrong. This is particularly important in simultaneous interpretation where translations have to be provided as fast as possible: before a sentence is complete. Yet, evaluations of simultaneous machine translation (SimulMT) fail to capture if systems correctly translate the most salient elements of a question: people, places, and dates. To address this problem, we introduce a downstream word-by-word question answering evaluation task (SimQA): given a source language question, translate the question word by word into the target language, and answer as soon as possible. SimQA jointly measures whether the SimulMT models translate the question quickly and accurately, and can reveal shortcomings in existing neural systems{---}hallucinating or omitting facts.",
}