Skip to content

AliFazelniya/Student-Performance

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Student Performance Analytics

End-to-end exploration and modeling of the UCI student performance datasets (Mathematics and Portuguese). The project delivers reproducible preprocessing, side-by-side EDA on raw and engineered features, supervised models for pass/fail and grade prediction, unsupervised clustering, and auto-saved visuals and summaries under reports/.

Student Performance Overview

Repository Overview

  • data/raw/: Original UCI CSVs (student-mat.csv, student-por.csv) plus metadata archives (student+performance.zip, student.txt).
  • data/processed/: Precomputed feature matrices (processed_mat.csv, processed_por.csv) generated by the preprocessing pipeline.
  • src/preprocess_data.py: End-to-end preprocessing script (one-hot for binary/categorical columns, standard scaling for numerics, aligned feature set across both subjects).
  • notebooks/: EDA notebooks:
    • Data_Review.ipynb: IQR outlier review on raw math/Portuguese datasets with boxplots.
    • Analyze_Raw.ipynb: Raw-data distributions, full correlation heatmaps, G3 correlation tables, and KDE comparisons of key fields.
    • Analyze_Processed.ipynb: Same diagnostics on the processed feature matrices.
  • Models/: Modeling notebooks (intended for notebook execution; Classification/Regression also include if __name__ == "__main__": guards if converted to scripts):
    • Classification.ipynb: Pass/fail classifiers (Logistic Regression, Random Forest, optional XGBoost), test + 5-fold CV metrics, confusion/ROC/PR plots.
    • Regression.ipynb: G3 regression with and without G1/G2, Linear/RandomForest/optional XGBoost, MAE/RMSE/R2 and CV RMSE.
    • Kmeans.ipynb: K-means (K=2-8) with silhouette selection, PCA scatter, cluster summaries, and pass-rate per cluster.
  • scripts/plot_utils.py: Utility to save all open Matplotlib figures to reports/Datasets/<dir>/ (used by notebooks).
  • scripts/student_merge.R: R helper to merge math/Portuguese records on shared demographics.
  • reports/: Generated assets:
    • Datasets/Raw and Datasets/Processed: EDA figures (distributions, correlation heatmaps, G3 correlation bars, KDE comparisons, outlier boxplots).
    • Models/Classification, Models/Regression, Models/KMeans: Model plots plus CSV summaries.

Data & Preprocessing

  1. Raw data lives in data/raw/ (student-mat.csv, student-por.csv, semicolon-separated). Keep filenames unchanged.
  2. Generate processed feature matrices (binary/categorical one-hot encoded; numeric scaled) aligned across subjects:
    python -m src.preprocess_data
    Outputs: data/processed/processed_mat.csv, data/processed/processed_por.csv.
  3. Processed files are already checked in for convenience; regenerate them if you update the raw data.

Exploratory Notebooks

  • notebooks/Data_Review.ipynb: Basic schema checks and Tukey outlier inspection on raw datasets; saves boxplots to reports/Datasets/Raw/.
  • notebooks/Analyze_Raw.ipynb: Distribution grids, correlation heatmaps, G3 correlation table, and KDE comparisons for key features on raw data; saves to reports/Datasets/Raw/.
  • notebooks/Analyze_Processed.ipynb: Mirrors the above analyses on the engineered feature matrices; saves to reports/Datasets/Processed/.

Plot Capture Utility

scripts.plot_utils.save_all_figs(title: str, dir: str) saves every open Matplotlib figure to reports/Datasets/<dir>/ with slugged filenames. Example inside notebooks:

from scripts.plot_utils import save_all_figs
save_all_figs("Correlation Heatmap - student-mat", "Raw")       # Raw analyses
save_all_figs("Correlation Heatmap - student-mat", "Processed") # Processed analyses

Ensure the target subdirectory (Raw or Processed) exists under reports/Datasets/.

Modeling Notebooks

  • Models/Classification.ipynb
    • Loads processed features + raw labels, builds binary target (G3 >= 10).
    • Trains Logistic Regression, Random Forest, and XGBoost (if installed); computes test Accuracy/Precision/Recall/F1/AUC and 5-fold CV accuracy.
    • Saves per-model confusion, ROC, and PR curves plus reports/Models/Classification/classification_results_summary.csv.
  • Models/Regression.ipynb
    • Predicts G3 with and without G1/G2 features using Linear Regression, RandomForestRegressor, and optional XGBRegressor.
    • Reports MAE, RMSE, R2, and CV RMSE; writes residual and predicted-vs-true plots.
    • Summary CSV: reports/Models/Regression/regression_results_summary.csv.
  • Models/Kmeans.ipynb
    • Clusters processed features (excluding G3) for mat and por.
    • Selects K via silhouette over K=2-8, saves PCA scatter, cluster heatmap, and CSVs for feature means and pass rates.

Environment & Setup

  • Python 3.10+ recommended.
  • Core Python deps: pandas, numpy, scikit-learn, matplotlib, seaborn; optional xgboost for boosted models.
  • Jupyter (or VS Code notebooks) to run .ipynb files; R (optional) for scripts/student_merge.R.
python3 -m venv .venv
source .venv/bin/activate
pip install pandas numpy scikit-learn matplotlib seaborn  # + xgboost if available

Quickstart

# 1) Preprocess data (creates data/processed/*)
python -m src.preprocess_data

# 2) Run EDA notebooks
jupyter notebook notebooks/  # open Data_Review, Analyze_Raw, Analyze_Processed and run all cells

# 3) Run modeling notebooks
jupyter notebook Models/     # open Classification, Regression, Kmeans and run all cells

# (Optional) Merge math/Portuguese records in R
Rscript scripts/student_merge.R

Existing Reports

Pre-generated plots and summaries are checked in under reports/ for quick reference:

  • EDA outputs: reports/Datasets/Raw/ and reports/Datasets/Processed/.
  • Classification metrics/plots: reports/Models/Classification/ plus classification_results_summary.csv.
  • Regression diagnostics: reports/Models/Regression/ plus regression_results_summary.csv.
  • K-means visuals and tables: reports/Models/KMeans/.

About

End-to-end exploration and modeling of the UCI student performance datasets (Mathematics and Portuguese)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors