Finetuning-Opus-MT-en-fr-WikiData

# English-French Machine Translation Project

## Overview
This project focuses on building and fine-tuning a machine translation model for English-French translation using the Helsinki-NLP/opus-mt-en-fr model from Hugging Face. The project involves dataset selection, extensive data filtering and preprocessing, model training, and performance evaluation using various metrics.

## Dataset Selection

### Source
- **Dataset:** English-French Wiki dataset
- **Source:** [Opus](https://opus.nlpl.eu/Wikipedia/en&fr/v1.0/Wikipedia)

### Initial Sentence Pairs
- **Total Pairs:** 818,302

## Data Filtering and Preprocessing

### Filtering Process
1. **Duplicates Removed:** 803,704 pairs removed
2. **Sentences ≤ 200 Words:** 801,392 pairs retained
3. **Length Ratio ≤ 1.5:** 691,348 pairs retained
4. **Non-printable Characters Removed:** 691,222 pairs retained (126 pairs removed)

### Preprocessing Pipeline
- **Character Cleaning:** Removed non-printable/control characters from both source and target sentences.
- **Normalization:** Applied NFKC normalization and reduced multiple spaces to single spaces.
- **Symbol Removal:** Eliminated unwanted symbols while retaining essential punctuation using regex.

## COMET Scoring and Sampling

### Scoring
- **Method:** Calculated COMET (wmt20-comet-qe-da) scores for 50% of the data (345,611 pairs) for further filtering.

### Language Detection

#### Source Language Distribution
| Language | Count   |
|----------|---------|
| en       | 316,161 |
| fr       | 11,872  |
| de       | 4,039   |
| it       | 2,339   |
| pt       | 818     |
| ca       | 971     |
| nl       | 668     |
| id       | 628     |
| es       | 1,166   |
| ro       | 499     |
| tl       | 1,057   |
| af       | 481     |
| sv       | 528     |
| unknown  | 1,357   |

#### Target Language Distribution
| Language | Count   |
|----------|---------|
| fr       | 298,086 |
| en       | 14,621  |
| de       | 611     |
| it       | 540     |
| ca       | 549     |
| es       | 460     |
| pt       | 142     |
| ro       | 145     |
| nl       | 184     |
| id       | 82      |
| unknown  | 98      |

### Filtered Rows and Sampling
- **Filtered Rows:** 298,086 pairs retained after removing sentences with incorrect language pairs (en-fr).
- **Sampling:** Randomly selected 130K pairs, split into:
  - **Training Set:** 100K pairs
  - **Validation Set:** 15K pairs
  - **Test Set:** 15K pairs

## Model Fine-Tuning

### Model Used
- **Model:** [Helsinki-NLP/opus-mt-en-fr](https://huggingface.co/Helsinki-NLP/opus-mt-en-fr)

> **Note:** The model exhibited overfitting after 10 epochs. Therefore, the checkpoint from Epoch 10 was used for evaluation.

## Performance Evaluation

### Baseline
- **SacreBLEU:** 42.20
- **chrF++:** 65.99
- **COMET (Unbabel/wmt20-comet-qe-da):** 0.395

### Fine-Tuned Model
- **SacreBLEU:** 43.55
- **chrF++:** 69.16
- **COMET (Unbabel/wmt20-comet-qe-da):** 0.566

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data-files		data-files
evaluation_scripts		evaluation_scripts
notebooks		notebooks
saved_model/checkpoint-250000		saved_model/checkpoint-250000
README.md		README.md
Report.pdf		Report.pdf
data_filtering.ipynb		data_filtering.ipynb
graph.py		graph.py
train.py		train.py
training_log.txt		training_log.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Finetuning-Opus-MT-en-fr-WikiData

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Finetuning-Opus-MT-en-fr-WikiData

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages