Skip to content

h-j-han/automatic_explicitation

Repository files navigation

Bridging Background Knowledge Gaps in Translation with Automatic Explicitation

This repository contains our WikiExpl dataset, a semi-automatic collection of naturally occurring explicitations in Wikipedia bitext corpus annotated by human translators, from our EMNLP 2023 main conference paper (arXiv).

The json files contain the candidates extracted by our detection algorithm. Each candidate is annotated by three annotators and we assign the label based on the majority vote. We consider the candidates as final explicitation if two or more annotators agree. The list of final explicitation is in expl_idx_list. We merge the annotated span of explicitation from different annotators by maximizing the span coverage.

We provide simple tools for easy exploration:

$ python show.py

The output example : output_example Here the red part in the source text (green) is that which is to be performed explicitation in the corresponding target translation, and the red part in the target text (blue) is its explicitation.

Note: The most recent code in this repo was written in June 2023.

Environment Setup

$ conda create -n autoexpl python=3.9.12
# module load cuda/11.3.1 cudnn/v8.2.1
$ conda activate autoexpl
$ pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113
$ pip install -r requirements.txt

Building the WikiExpl Dataset of Explicitations

First, This section describes how we collect and examine naturally-occurring explicitations in bitexts commonly used as MT training data. The resulting WikiExpl corpus lets us reason about when explicitation is necessary and lets our automatic explicitation method learn how to generate it.

The detection code and data will be provided in the near future. The decision code and its application to the XQB dataset are available and described below.

Evaluating Automatic Explicitation

This section builds on WikiExpl to explore generating explicitations automatically.
We use the XQB dataset, which contains parallel question pairs in English and various non-English languages.
For more details of XQB dataset and Quizbowl task, please refer to each links.
The original XQB dataset is from https://github.com/h-j-han/simqa and original quizbowl evaluation code is from https://github.com/Pinafore/qb.

Generate Automatic Explicitation in XQB-pl/es for Evaluation

We use the same decision algorithm to determine whether explicitation is needed, and apply it to generate automatic explicitations on the XQB-pl/es (Polish/Spanish-to-English Quizbowl Dataset).

Get NER

First, detect named entities (NERs) in the sentences. (The default parameter settings are for XQB-es. Use additional parameter options for XQB-pl.)

python autoexpl/save_ner.py

This script outputs files to: xqb_eval/exp_gen/nerdict.*.pkl.

Decide and Get Wiki ID data

Next, based on the NER information, determine whether explicitation should be performed and extract the corresponding Wikidata ID for the selected named entity. Note: This step may take some time and require larger memory.
(The default parameter settings and the script examples below are for XQB-es. Use additional parameter options for XQB-pl.)

Download the file lang_title2wikidataID-normalized_with_redirect.pkl from https://github.com/facebookresearch/GENRE and place it in the models/genre

Then, run:

python autoexpl/save_wikidataid.py --lang es --sent-file xqb_eval/exp_gen/esqbv1.sent.es
python autoexpl/save_wikidataid.py --lang en --sent-file xqb_eval/exp_gen/esqbv1.sent.en

This script outputs files to: xqb_eval/exp_gen/wikiid.*.pkl.

Generate Auto Explicirtation

Finally, generate the explicitation for the selected named entities by retrieving information based on their Wikidata IDs. Note: The generated explicitations may not exactly replicate those in the original 2023 dataset, as entity descriptions on Wikidata may have changed over time.
(The default parameter settings are for XQB-es. Use additional parameter options for XQB-pl.)

python autoexpl/xqb/gen_exp_xqb_pair_charskip.py

This script outputs files to: xqb_eval/extrinsic/esqbv1htall.pair_coment_charentskip_dedup_gent4.*.json and xqb_eval/esqbv1htall.pair_coment_charentskip_dedup_gent4.esen.exp.json

Evaluate the generated Auto Explicitation on XQB

Input Data

xqb_eval/plqbv1ht512.pair_coment_charentskip_dedup_gent4.xxxx.exp.json: contains all the details of explicitation from which entity in which part of the question to what types of explicitation and what explanation to be added.

  • orig_qid corresponds to question id in XQB.
  • ent_id is the id of an entity detected in the question text, appending two digits on oriq_qid.
  • exp_id is a different version of explicitation generations, appending two digits on ent_id.

xqb_eval/extrinsic/plqbv1ht512.pair_coment_charentskip_dedup_gent4.xx.json: realization of each explicitation within the question based on *.xxxx.exp.json. This is to be consumed by the guesser.

  • qanta_id is exp_id above.

Try `python autoexpl/XQB/display_explicitation.py' for highlighted text display of XQB explicitations.

Generate guesses

We are using LLaMA model to generate guesses from https://github.com/meta-llama/llama/tree/57b0eb62de0636e75af471e49e2f1862d908d9d8. You can git submodule update --init --recursive to get a certain repo.
In xqb_eval/extrinsic/rawresult*, we provided our results.
If you want to replicate our result from scratch, the model download is required.

$ ./gen_guess_llama.sh # set start index and end index of the questions if it takes too long for the entire set

You can merge pieces with autoexpl/XQB/merge_split_guess.py.

Parse raw text output

Input is rawresult* and output is LLaMA*

python autoexpl/XQB/parse_raw_guesses_llama_step.py --lang pl  --dataset-name plqbv1ht512 --ckpt-dir /13B
python autoexpl/XQB/parse_raw_guesses_llama_step.py --lang en  --dataset-name plqbv1ht512 --ckpt-dir /13B
python autoexpl/XQB/parse_raw_guesses_llama_step.py --lang es  --dataset-name esqbv1htall --ckpt-dir /7B
python autoexpl/XQB/parse_raw_guesses_llama_step.py --lang en  --dataset-name esqbv1htall --ckpt-dir /7B

Gather results

Input is LLaMA* and output is origvsexp*

python autoexpl/XQB/gather_guessbuzz_llama.py

Finally, we provide plot.ipynb to reproduce the plots in our paper.

Human Evaluation

Three human evaluation results are in xqb_eval/intrinsic. All entitiys were shown to annotators to evaluate is the decisions on the entity was helpful or not. An entity has three version of explicitation with different generation types and integration. Only one explicitation generation per entity was shown to one annotator.

Reference

@inproceedings{han-etal-2023-auto-explicitation,
    title = "Bridging Background Knowledge Gaps in Translation with Automatic Explicitation",
    author = "Han, HyoJung  and Boyd-Graber, Jordan  and Carpuat, Marine",
    booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing",
    month = dec,
    year = "2023",
    address = "Singapore, Singapore",
    publisher = "Association for Computational Linguistics",
    url = "https://openreview.net/pdf?id=PBvSGqYCSa",
}

If you also use intrinsic evaluation result or follow extrinsic evaluation with XQB, please cite

@inproceedings{han-etal-2022-simqa,
    title = "{S}im{QA}: Detecting Simultaneous {MT} Errors through Word-by-Word Question Answering",
    author = "Han, HyoJung  and
      Carpuat, Marine  and
      Boyd-Graber, Jordan",
    booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing",
    month = dec,
    year = "2022",
    address = "Abu Dhabi, United Arab Emirates",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.emnlp-main.378",
    pages = "5598--5616",
    abstract = "Detractors of neural machine translation admit that while its translations are fluent, it sometimes gets key facts wrong. This is particularly important in simultaneous interpretation where translations have to be provided as fast as possible: before a sentence is complete. Yet, evaluations of simultaneous machine translation (SimulMT) fail to capture if systems correctly translate the most salient elements of a question: people, places, and dates. To address this problem, we introduce a downstream word-by-word question answering evaluation task (SimQA): given a source language question, translate the question word by word into the target language, and answer as soon as possible. SimQA jointly measures whether the SimulMT models translate the question quickly and accurately, and can reveal shortcomings in existing neural systems{---}hallucinating or omitting facts.",
}

About

Bridging Background Knowledge Gaps in Translation with Automatic Explicitation in EMNLP2023

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published