This project is an extension of Text Classifier v1, using a stronger XGBoost model and larger dataset. It explores linguistic data science and syntactic complexity modeling.
- Upgraded from Random Forest to XGBoost
- More dataset for training (from 300 → 1000 samples)
- More stable predictions and improved feature interpretability
The classifier uses L2SCA indices by TAASSC.
Predictions are supported by SHAP contribution plots, showing how each feature influences the outcome toward AI or SLW.
The dataset used for model training consists of 1,000 writing samples (500 human, 500 AI):
- Human-written:
500 essays by second language writers (SLW), sourced from ICNALE - AI-generated:
500 essays generated by large language models (LLMs), sourced from LLM-generated Essay Dataset
Data preprocessing by TAASSC.
-
The
.txtfiles intxt_samples/are included only for demonstration and learning purposes.
They are not licensed for reuse, redistribution, or commercial use. -
The dataset file
X_binary.csvis private and is not licensed for reuse, redistribution, or modification.
It is shared solely for demonstration purposes and should not be used for any other purpose.