This project analyzes factors influencing student exam performance using machine learning techniques. The analysis explores the relationship between various factors (study habits, socioeconomic background, support systems) and exam scores, then builds predictive models to identify at-risk students.
- Which factors most significantly influence exam performance?
- What is the relative importance of controllable vs. uncontrollable factors?
- Can we build a predictive model to identify at-risk students early?
- Are there unexpected patterns or anomalies in student performance?
Data-Science-Student-Performance/
├── README.md # This file
├── requirements.txt # Python dependencies
├── .gitignore # Files to exclude from Git
│
├── data/
│ ├── raw/
│ │ └── StudentPerformanceFactors.csv # Original dataset
│ ├── processed/ # Cleaned data (generated)
│ └── README.md # Data description & source
│
├── notebooks/
│ └── Student_performance_factors.ipynb # Complete analysis notebook
│
├── models/ # Saved trained models (generated)
│ └── best_model.pkl
│
├── reports/
│ └── figures/ # Visualizations (generated)
│
├── results/ # Model metrics & predictions (generated)
│
└── docs/ # Additional documentation
- Source: StudentPerformanceFactors.csv
- Size: 6,607 records (6,604 after cleaning)
- Features: 20 original features (7 numerical, 13 categorical)
- Target Variable: Exam_Score (0-100)
- Numerical: Hours_Studied, Attendance, Previous_Scores, Sleep_Hours, Tutoring_Sessions, Physical_Activity
- Categorical: Parental_Involvement, Access_to_Resources, Motivation_Level, Family_Income, Teacher_Quality, School_Type, Gender, etc.
- Data Collection & Inspection - Load and explore dataset structure
- Data Cleaning - Handle missing values, remove anomalies
- Exploratory Data Analysis (EDA)
- Univariate analysis with statistical tests (Shapiro-Wilk, skewness, kurtosis)
- Bivariate analysis (correlation, scatter plots, box plots)
- Outlier detection using IQR method
- Feature Engineering - Create 7 new features (Study_Efficiency, Support_Score, etc.)
- Model Training - Train 3 models: Linear Regression, Decision Tree, Neural Network
- Model Evaluation - Compare models using R², RMSE, MAE metrics
- Best Model: Decision Tree (R² = ~0.98, RMSE = ~1.5 points)
- Top Predictors: Previous_Scores, Hours_Studied, Attendance, Motivation_Level, Parental_Involvement
- Controllable factors (study hours, attendance) have stronger impact than uncontrollable factors (family income, distance)
- High motivation adds ~10-15 points to exam scores on average
- Python 3.8 or higher
- Jupyter Notebook or JupyterLab
-
Clone this repository:
git clone <repository-url> cd Data-Science-Student-Performance
-
Install required dependencies:
pip install -r requirements.txt
-
Launch Jupyter Notebook:
jupyter notebook
-
Open
notebooks/Student_performance_factors.ipynband run all cells
- Explore the Data: Run cells in
Student_performance_factors.ipynbsequentially - View Visualizations: All plots are generated inline in the notebook
- Model Training: Models are trained automatically when running the notebook
- Make Predictions: Use the trained models to predict exam scores for new students
-
Model Performance:
- Linear Regression: R² ≈ 0.80
- Decision Tree: R² ≈ 0.98 (Best)
- Neural Network: R² ≈ 0.85
-
Feature Importance: Previous_Scores > Hours_Studied > Attendance > Motivation_Level
- Python 3.x
- Data Analysis: pandas, numpy
- Visualization: matplotlib, seaborn
- Statistical Tests: scipy.stats
- Machine Learning: scikit-learn (LinearRegression, DecisionTreeRegressor, MLPRegressor)
✅ Complete - Ready for submission
WIA1007 Data Science Course Project
This project is for educational purposes only.