This project, developed for the NLP course (AIM 829) at the International Institute of Information Technology, Bangalore, implements and compares two deep learning architectures for English-to-Hindi machine translation and integrates the superior model into a full-fledged audio-to-audio translation pipeline.
A considerable number of non-native English speakers encounter difficulties comprehending spoken English educational content, particularly when delivered with unfamiliar accents. This project aims to bridge that gap by converting English-spoken video content into Hindi, making it more accessible.
The final system takes an English video, translates the speech to Hindi, and generates a new video with synchronized Hindi audio.
This repository contains two main components:
- Two NMT Models: An LSTM-based Seq2Seq model and a more advanced Transformer model.
- An End-to-End Pipeline: A comprehensive system that translates spoken English from a video file into spoken Hindi.
The final pipeline operates in the following stages:
- Audio Extraction & Speech Recognition: Extracts the audio from the input video and uses Google's Speech Recognition API to convert English speech to text.
- Punctuation Restoration: Applies the Silero TTS Engine to restore proper punctuation to the transcribed text, which is crucial for accurate translation.
- Neural Machine Translation (NMT): Uses our trained Transformer model to translate the English sentences into Hindi.
- Text-to-Speech (TTS) Synthesis: Employs a high-quality
facebook/mms-tts-hinmodel to generate natural-sounding Hindi audio from the translated text. - Audio Synchronization & Video Generation: Aligns the newly generated Hindi audio segments with the original video's timing (preserving silences) and creates the final output video.
We evaluated both models using the BLEU score, a standard metric for translation quality. The Transformer architecture demonstrates a massive improvement over the baseline Seq2Seq model.
Here is a direct statistical comparison of the two models. The Transformer architecture provides a significant improvement in translation quality, even with fewer training epochs.
| Metric | Seq2Seq (LSTM) Model | Transformer Model (Final) |
|---|---|---|
| BLEU-1 Score | 0.1512 (15.1%) | 0.6174 (61.7%) Excellent professional quality |
| BLEU-4 Score | 0.0181 (1.8%) | 0.3402 (34.0%) |
| Architecture | RNN-based Encoder-Decoder | See Architecture Details Below |
| Training Time | ~4+ hours | ~7.5 hours (on P100 GPU) |
| Epochs Trained | 15 (with Early Stopping) | 10 |
| Translation Quality | Basic, often incorrect | Coherent & Contextually Aware |
Our final model is based on the architecture from "Attention Is All You Need" by Vaswani et al., implemented from scratch with the following key hyperparameters:
- Encoder/Decoder Stacks: The model consists of 3 identical layers in both the encoder and decoder stacks.
- Multi-Head Attention: Each self-attention and cross-attention mechanism utilizes 8 parallel attention heads, allowing the model to jointly attend to information from different representation subspaces.
- Embedding Dimension (
d_model): All embedding layers and sub-layer outputs produce vectors of dimension 512. - Feed-Forward Networks: The position-wise feed-forward network within each layer consists of two linear transformations with a dimensionality of 2048.
- Regularization: A Dropout rate of 0.1 is applied to the output of each sub-layer before it is added to the sub-layer input (residual connection).
- Vocabulary Size: The source (English) and target (Hindi) vocabularies are capped at approximately 10,000 tokens each.
This model served as our baseline. While it learned some word-level associations, it struggled with grammar and sentence structure.
- BLEU-1 Score: 0.1512 (15.1%)
- BLEU-4 Score: 0.0181 (1.81%)
Sample Translation (Seq2Seq):
| English | Reference Hindi | Model Output (Incorrect) |
|---|---|---|
| I can see | मैं देख सकता हूँ | मैं देख सकते |
The Transformer model, based on the "Attention Is All You Need" paper, significantly outperformed the baseline. It effectively captures contextual relationships, leading to vastly superior translations.
- BLEU-1 Score: 0.6174 (61.7%) - Excellent professional quality
- BLEU-4 Score: 0.3402 (34.02%) - Good to high-quality translations
Sample Translations (Transformer):
| English | Hindi Translation (Correct) |
|---|---|
| I love to eat delicious food. | मैं स्वादिष्ट खाना खाता हूँ। |
| What is your name? | आपका नाम क्या है ? |
| India is a beautiful country... | भारत समृद्ध सांस्कृतिक विरासत वाला एक सुंदर देश है । |
To get a local copy up and running, follow these steps.
You will need Python 3.8+ and pip installed. The pipeline also requires ffmpeg for audio processing.
- Install ffmpeg:
# On Ubuntu/Debian sudo apt update && sudo apt install ffmpeg # On MacOS (using Homebrew) brew install ffmpeg
- Clone the repository:
git clone https://github.com/bajoriya-vaibhav/Neural-Machine-Translation-Transformer.git cd Neural-Machine-Translation-Transformer - Install Python dependencies:
pip install -r requirements.txt
- Download SpaCy model:
python -m spacy download en_core_web_sm
The pre-trained model weights are required to run the translation pipeline.
Place the downloaded model files in a models/ directory in the project root.
The primary entry point for the audio-to-audio translation is the main pipeline script.
python main_pipeline.py --input_video_path /path/to/your/video.mp4 --output_video_path /path/to/output/video.mp4Check the respective notebooks for details on how to train the models or run inference on text-only data.
The models were trained on publicly available parallel corpora:
- IIT Bombay English-Hindi Corpus: A large-scale corpus consisting of aligned sentence pairs.
- TED Talks Hindi-English Truncated Corpus: A high-quality dataset derived from translated TED talk transcripts.
- (Abhinav Kumar) - Seq2Seq model training, report preparation.
- (Abhinav Deshpande) - Seq2Seq & Transformer model definition, audio pipeline architecture.
- (Vaibhav Bajoriya) - Transformer model training & definition, TTS model integration.
- (Shashank Devarmani) - Test video preparation, speech-to-text integration, pipeline debugging.
- This project was completed as part of our coursework at the International Institute of Information Technology, Bangalore.
- We thank the creators of the open-source libraries and datasets that made this work possible.