A Machine Learning pipeline for automated IELTS writing scoring, predicting band scores across four rubric categories.
Academic Context: Final Exam for the Artificial Intelligence course (2025). Ranked via a Kaggle competition.
This project implements a regression pipeline to estimate IELTS band scores (0-9) based on essay content. It handles multiple target variables simultaneously.
- Preprocessing (
src/preprocess.py):- Data Cleaning: Removes punctuation/numbers and strips excess whitespace using Regex.
- Feature Engineering: Concatenates
promptandessayto provide full context to the model. - Vectorization: Uses TF-IDF (Term Frequency-Inverse Document Frequency) to convert text into numerical features (max 10,000 features).
- Model Training (
src/train.py):- Algorithm: MultiOutput Regressor wrapping a Random Forest (100 estimators).
- Why MultiOutput? IELTS essays are graded on 4 distinct criteria simultaneously. This architecture predicts all 4 scores in a single pass.
- Evaluation:
- Uses Mean Squared Error (MSE) on a validation split (20%) to assess prediction accuracy.
├── data/
│ ├── df_train.csv # Training data (Prompt + Essay + 4 Scores)
│ └── df_test.csv # Test data (Prompt + Essay only)
├── notebooks/
│ └── eda.ipynb # Exploratory Data Analysis
├── src/
│ ├── preprocess.py # Cleaning and TF-IDF vectorization logic
│ ├── train.py # Random Forest training & evaluation functions
│ └── predict.py # Submission file generation
├── outputs/
│ └── submission.csv # Generated predictions
├── main.py # Pipeline orchestrator
└── requirements.txt # Dependencies
- Python 3.8+
- Packages:
pandas,scikit-learn
git clone [https://github.com/noecrn/IELTS-Score-Predictor-ML.git](https://github.com/noecrn/IELTS-Score-Predictor-ML.git)
cd IELTS-Score-Predictor-ML
pip install -r requirements.txt
Run the complete training and prediction pipeline:
python main.py
This script will:
- Load and clean
df_train.csv. - Train the Multi-Output Random Forest model.
- Print the Validation MSE.
- Generate score predictions for
df_test.csvinoutputs/submission.csv.
The model predicts scores for the official IELTS writing rubric:
- Task Achievement
- Coherence and Cohesion
- Lexical Resource
- Grammatical Range
Metric: Mean Squared Error (MSE).