This project investigates the implementation of a Neural Machine Translation (NMT) system using a Sequence-to-Sequence (Seq2Seq) framework. Developed for the Imperial College London TensorFlow 2 Professional Certification, the study focuses on mapping latent semantic representations between English and German.
The project evolved from a 20,000-sample prototype into a production-ready pipeline trained on a massive corpus, achieving significant improvements in translation fluency and structural alignment.
The system utilizes a dual-recurrent framework optimized for high-dimensional semantic mapping.
- Transfer Learning: Utilizes a pre-trained NNLM (Neural-Net Language Model) embedding from TensorFlow Hub to project English tokens into a 128-dimensional latent space.
-
Latent Bottleneck: Employs a 512-unit LSTM layer to compress the entire source sequence into a final hidden (
$h$ ) and cell ($c$ ) state (the "Context Vector"). - Orthogonal Initialization: Used for all LSTM units to stabilize gradient flow across long sequences and prevent vanishing gradients.
- Vocabulary Capping: The German vocabulary was strategically capped at 15,000 tokens to prioritize high-frequency linguistic structures and optimize GPU VRAM usage.
- Teacher Forcing: Utilized during training to accelerate convergence by feeding ground-truth tokens back into the recurrent engine.
graph LR
subgraph Encoder
A[English Input] --> B(Embedding)
B --> C[LSTM Layer]
end
C -->|Hidden + Cell States| D{Context Vector}
subgraph Decoder
D --> E[LSTM Layer]
E --> F(Dense Softmax)
F --> G[German Output]
end
G -.->|Feedback Loop| E
Figure: Sequence-to-Sequence framework with Latent Bottleneck and Recursive Inference.
To transition from a 20,000 to a 200,000+ sample corpus, the following optimizations were critical:
-
Streaming Pipeline: Implemented a
tf.data.Datasetarchitecture with Asynchronous Prefetching (tf.data.AUTOTUNE) to ensure zero GPU starvation. -
Vectorized Padding: Dynamic truncation and padding to a fixed length of 13 ensured tensor compatibility with the Functional API.
-
Masked Loss: Applied a custom Masked Sparse Categorical Cross-Entropy loss to ensure that padding tokens did not dilute the gradient signal.
Benchmark results obtained on a strictly isolated 20,000-sentence holdout set:
| Metric | Result | Interpretation |
|---|---|---|
| BLEU Score | 17.32 | Strong baseline performance with significant n-gram overlap. |
| Validation Perplexity | 5.35 | High confidence in word prediction (branching factor < 6). |
| Training Samples | 160,000 | Robust exposure to bilingual syntax patterns. |
| Batch Size | 64 | Balanced gradient stability with memory efficiency. |
Masked Sparse Categorical Crossentropy over 10 epochs. The convergence of validation loss indicates robust generalization.
The notebook is configured for Automated Pipeline Integration. It automatically fetches the English-German corpus (provided by Imperial College) directly from Google Drive using the gdown utility. The dataset is based on the language dataset from ManyThings.org/anki, which consists of over 200,000 sentence pairs.
The easiest way to run the study is via Google Colab.
Recommended for users with NVIDIA GPUs to leverage cuDNN acceleration.
git clone https://github.com/fvalerii/nmt-seq2seq-translation.gitIt is recommended to use a environment with Python 3.12.8:
pip install -r requirements.txtconda env create -f environment.yml
conda activate nmt_researchOpen the notebook in VS Code or Jupyter: notebooks/nmt_english_german_seq2seq.ipynb
- Frameworks: TensorFlow 2.x, Keras, TensorFlow Hub
- Libraries: NumPy, Matplotlib, NLTK (BLEU), scikit-learn
- Execution: Optimized for NVIDIA GPU acceleration
This project serves as Capstone Research Study for the "TensorFlow 2 for Deep Learning" Professional Certification by Imperial College London. It demonstrates mastery of custom training loops, model subclassing, and complex NLP data engineering.
Note: To ensure scientific reproducibility, global random seeds were set for NumPy, Python, and TensorFlow. Note that minor variances (<0.1%) may still occur due to non-deterministic CUDA kernels when switching between GPU architectures.