This repository contains code for the English-German neural machine translation project (NMT). Below is a quick overview of the repo layout:
dataset/: This folder contains the data set used for this project which is too large to maintain in this repo, but can be freely downloaded from Kaggle: https://www.kaggle.com/datasets/mohamedlotfy50/wmt-2014-english-german. Runpython dataset/split_dsetafter saving the paired data into thedataset/directory to prepare it for use. Seedataset/split_dsetfor details.eval_tables/: This folder contains evaluation tables that record the performance of each model across a variety of data-subsets using a variety of automatic evaluation metrics. Results are organized within this folder by decode algo (i.e. greedy or beam search).google_api/: This folder contains a script that is able to generate output translations by sending queries to the Google Translate API. This folder also contained the cached results of doing so on various data-subsets.model_pred/: This folder contained a module (cache_predictions.py) that caches model predictions to this directory for various data-subsets.models/: This folder contains the definition and implementation of each model class. A module namedall_models.pyimports all of them so it can be used to quickly import all.venv_req/: This folder contains arequirements.txtfile specifying the virtual environment requirements for running this repo.vocab/: This folder containsvocab.pywhich generates the cached sub-word tokenizer vocabs for each language. Those cached tokenizer files are also stored in this directory.model_eval.py: This module contains code related to evaluating model performance and generating the evaluation summaries found ineval_tables/.train.py: This module contains code related to training the models.util.py: This module contains general utility functions used throughout the repo.
This project leveraged materials from Stanford University's Natural Language Processing with Deep Learning (XCS231N) course, with many modifications.