Skip to content

Automated IELTS scoring pipeline using a Multi-Output Random Forest regressor and TF-IDF vectorization to predict 4 rubric categories simultaneously.

Notifications You must be signed in to change notification settings

noecrn/IELTS-Score-Predictor-ML

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

IELTS Score Predictor ML

Language Library Technique

A Machine Learning pipeline for automated IELTS writing scoring, predicting band scores across four rubric categories.

Academic Context: Final Exam for the Artificial Intelligence course (2025). Ranked via a Kaggle competition.


🏗 Architecture & Methodology

This project implements a regression pipeline to estimate IELTS band scores (0-9) based on essay content. It handles multiple target variables simultaneously.

Key Steps:

  1. Preprocessing (src/preprocess.py):
    • Data Cleaning: Removes punctuation/numbers and strips excess whitespace using Regex.
    • Feature Engineering: Concatenates prompt and essay to provide full context to the model.
    • Vectorization: Uses TF-IDF (Term Frequency-Inverse Document Frequency) to convert text into numerical features (max 10,000 features).
  2. Model Training (src/train.py):
    • Algorithm: MultiOutput Regressor wrapping a Random Forest (100 estimators).
    • Why MultiOutput? IELTS essays are graded on 4 distinct criteria simultaneously. This architecture predicts all 4 scores in a single pass.
  3. Evaluation:
    • Uses Mean Squared Error (MSE) on a validation split (20%) to assess prediction accuracy.

📂 Project Structure

├── data/
│   ├── df_train.csv       # Training data (Prompt + Essay + 4 Scores)
│   └── df_test.csv        # Test data (Prompt + Essay only)
├── notebooks/
│   └── eda.ipynb          # Exploratory Data Analysis
├── src/
│   ├── preprocess.py      # Cleaning and TF-IDF vectorization logic
│   ├── train.py           # Random Forest training & evaluation functions
│   └── predict.py         # Submission file generation
├── outputs/
│   └── submission.csv     # Generated predictions
├── main.py                # Pipeline orchestrator
└── requirements.txt       # Dependencies

🚀 Getting Started

Prerequisites

  • Python 3.8+
  • Packages: pandas, scikit-learn

Installation

git clone [https://github.com/noecrn/IELTS-Score-Predictor-ML.git](https://github.com/noecrn/IELTS-Score-Predictor-ML.git)
cd IELTS-Score-Predictor-ML
pip install -r requirements.txt

Usage

Run the complete training and prediction pipeline:

python main.py

This script will:

  1. Load and clean df_train.csv.
  2. Train the Multi-Output Random Forest model.
  3. Print the Validation MSE.
  4. Generate score predictions for df_test.csv in outputs/submission.csv.

📊 Targets & Metrics

The model predicts scores for the official IELTS writing rubric:

  • Task Achievement
  • Coherence and Cohesion
  • Lexical Resource
  • Grammatical Range

Metric: Mean Squared Error (MSE).


👥 Authors

  • Noé Cornu - Engineering Student @ EPITA - GitHub | LinkedIn
  • Baptiste Rio - Engineering Student @ EPITA

About

Automated IELTS scoring pipeline using a Multi-Output Random Forest regressor and TF-IDF vectorization to predict 4 rubric categories simultaneously.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages