- Paper: Automatically Identifying Gender Issues in Machine Translation using Perturbations
- Point of Contact: Hila Gonen, Kellie Webster
This dataset is intended as an evaluation benchmark for gender issues in Machine Translation. We consider the challenges in modeling and handling gendered language in the context of machine translation and extend over previous work that identifies issues using synthetic examples. We focus on the class of issues which surface when a neutral reference to a person is translated to a gendered form. For this class of examples, the MT task requires a system to produce a single translation without source cues, thus exposing a model's preferred gender for the reference form. We include English source sentences, and four target gendered languages across three language families (French, German, Spanish, and Russian). The examples included in the dataset expose where MT encodings are gendered, finding new issues not covered in previous manual approaches.
The dataset is automatically curated using the following pipeline:
- We keep only gender-neutral sentences with a single human entity, the rest are filtered out.
- We create perturbations on the human entity in the source sentence, to form pairs of original/perturbed sentences.
- We translate all pairs to the target language and surface pairs in which the gender of the human entity differs between the original and the perturbed sentence.
- Human annotators verify the surfaced pairs ("at risk" pairs).
- Random ("not at risk") pairs are added to the dataset.
Source language: English.
Target languages: French, German, Spanish, and Russian.
We release this dataset under the Apache Version 2.0 license. Refer to LICENSE for details.
@inproceedings{gonen-webster-2020-automatically,
title = "Automatically Identifying Gender Issues in Machine Translation using Perturbations",
author = "Gonen, Hila and Webster, Kellie",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2020",
year = "2020",}
The dataset consists of pairs of English sentences that differ in a single word (the human entity), and their translation to a gendered language (either French, German, Spanish or Russian).
Each example consists of two six-line entries. Each six-line entry is separated
by a single blank line; examples are separated by two blank lines. Where the
first line is Original sentences:, the sentence is unchanged from the
underlying data; Substitution sentences: indicates that a word has been
perturbed.
"Original sentences:" or "Substitution sentences:"
a sentence, in English (source).
the sentence, in the target language.
a word from the English sentence.
the word, in the target language.
the grammatical gender of the word in the target language, fem or masc.
Statistics of the dataset per target language:
French: 100 "at risk" pairs, 100 "not at risk" pairs
German: 100 "at risk" pairs, 100 "not at risk" pairs
Spanish: 100 "at risk" pairs, 100 "not at risk" pairs
Russian: 59 "at risk" pairs, 100 "not at risk" pairs
This dataset is automatically curated, with a human verification step that follows the automatic process. The class of issues we are interested in are those where translation to a gender-marking language exposes a model's gender preference for a personal reference.
By publicly releasing our dataset, we hope to enable the community to work together towards solutions that are inclusive and equitable to all.
Sentences with no human entities or more than a single human entity are filtered out. Sentences that are not gender-neutral are also filtered out.
All sentences are taken from the subreddit "career". The perturbations to the entity are done using BERT.
There is no human annotation step, only a human verification step, where speakers of both source and target languages verify that "at risk" sentence pairs exhibit problematic behavior upon translation.
For each target language, the verifier was a speaker of both English and the target language.
The main goal of this dataset is to promote fairness in machine translation, by providing an evaluation benchmark to this end. We focus specifically on gender bias.
- The scale of the dataset is relatively small.
- The chosen sentences depend on the underlying translation model used for translation. In our case - Google Translate.