End-to-end exploration and modeling of the UCI student performance datasets (Mathematics and Portuguese). The project delivers reproducible preprocessing, side-by-side EDA on raw and engineered features, supervised models for pass/fail and grade prediction, unsupervised clustering, and auto-saved visuals and summaries under reports/.
data/raw/: Original UCI CSVs (student-mat.csv,student-por.csv) plus metadata archives (student+performance.zip,student.txt).data/processed/: Precomputed feature matrices (processed_mat.csv,processed_por.csv) generated by the preprocessing pipeline.src/preprocess_data.py: End-to-end preprocessing script (one-hot for binary/categorical columns, standard scaling for numerics, aligned feature set across both subjects).notebooks/: EDA notebooks:Data_Review.ipynb: IQR outlier review on raw math/Portuguese datasets with boxplots.Analyze_Raw.ipynb: Raw-data distributions, full correlation heatmaps, G3 correlation tables, and KDE comparisons of key fields.Analyze_Processed.ipynb: Same diagnostics on the processed feature matrices.
Models/: Modeling notebooks (intended for notebook execution; Classification/Regression also includeif __name__ == "__main__":guards if converted to scripts):Classification.ipynb: Pass/fail classifiers (Logistic Regression, Random Forest, optional XGBoost), test + 5-fold CV metrics, confusion/ROC/PR plots.Regression.ipynb: G3 regression with and without G1/G2, Linear/RandomForest/optional XGBoost, MAE/RMSE/R2 and CV RMSE.Kmeans.ipynb: K-means (K=2-8) with silhouette selection, PCA scatter, cluster summaries, and pass-rate per cluster.
scripts/plot_utils.py: Utility to save all open Matplotlib figures toreports/Datasets/<dir>/(used by notebooks).scripts/student_merge.R: R helper to merge math/Portuguese records on shared demographics.reports/: Generated assets:Datasets/RawandDatasets/Processed: EDA figures (distributions, correlation heatmaps, G3 correlation bars, KDE comparisons, outlier boxplots).Models/Classification,Models/Regression,Models/KMeans: Model plots plus CSV summaries.
- Raw data lives in
data/raw/(student-mat.csv,student-por.csv, semicolon-separated). Keep filenames unchanged. - Generate processed feature matrices (binary/categorical one-hot encoded; numeric scaled) aligned across subjects:
Outputs:
python -m src.preprocess_data
data/processed/processed_mat.csv,data/processed/processed_por.csv. - Processed files are already checked in for convenience; regenerate them if you update the raw data.
notebooks/Data_Review.ipynb: Basic schema checks and Tukey outlier inspection on raw datasets; saves boxplots toreports/Datasets/Raw/.notebooks/Analyze_Raw.ipynb: Distribution grids, correlation heatmaps, G3 correlation table, and KDE comparisons for key features on raw data; saves toreports/Datasets/Raw/.notebooks/Analyze_Processed.ipynb: Mirrors the above analyses on the engineered feature matrices; saves toreports/Datasets/Processed/.
scripts.plot_utils.save_all_figs(title: str, dir: str) saves every open Matplotlib figure to reports/Datasets/<dir>/ with slugged filenames. Example inside notebooks:
from scripts.plot_utils import save_all_figs
save_all_figs("Correlation Heatmap - student-mat", "Raw") # Raw analyses
save_all_figs("Correlation Heatmap - student-mat", "Processed") # Processed analysesEnsure the target subdirectory (Raw or Processed) exists under reports/Datasets/.
Models/Classification.ipynb- Loads processed features + raw labels, builds binary target (G3 >= 10).
- Trains Logistic Regression, Random Forest, and XGBoost (if installed); computes test Accuracy/Precision/Recall/F1/AUC and 5-fold CV accuracy.
- Saves per-model confusion, ROC, and PR curves plus
reports/Models/Classification/classification_results_summary.csv.
Models/Regression.ipynb- Predicts G3 with and without G1/G2 features using Linear Regression, RandomForestRegressor, and optional XGBRegressor.
- Reports MAE, RMSE, R2, and CV RMSE; writes residual and predicted-vs-true plots.
- Summary CSV:
reports/Models/Regression/regression_results_summary.csv.
Models/Kmeans.ipynb- Clusters processed features (excluding G3) for
matandpor. - Selects K via silhouette over K=2-8, saves PCA scatter, cluster heatmap, and CSVs for feature means and pass rates.
- Clusters processed features (excluding G3) for
- Python 3.10+ recommended.
- Core Python deps:
pandas,numpy,scikit-learn,matplotlib,seaborn; optionalxgboostfor boosted models. - Jupyter (or VS Code notebooks) to run
.ipynbfiles; R (optional) forscripts/student_merge.R.
python3 -m venv .venv
source .venv/bin/activate
pip install pandas numpy scikit-learn matplotlib seaborn # + xgboost if available# 1) Preprocess data (creates data/processed/*)
python -m src.preprocess_data
# 2) Run EDA notebooks
jupyter notebook notebooks/ # open Data_Review, Analyze_Raw, Analyze_Processed and run all cells
# 3) Run modeling notebooks
jupyter notebook Models/ # open Classification, Regression, Kmeans and run all cells
# (Optional) Merge math/Portuguese records in R
Rscript scripts/student_merge.RPre-generated plots and summaries are checked in under reports/ for quick reference:
- EDA outputs:
reports/Datasets/Raw/andreports/Datasets/Processed/. - Classification metrics/plots:
reports/Models/Classification/plusclassification_results_summary.csv. - Regression diagnostics:
reports/Models/Regression/plusregression_results_summary.csv. - K-means visuals and tables:
reports/Models/KMeans/.
