Student Performance Analysis - Data Science Project

Overview

This project analyzes factors influencing student exam performance using machine learning techniques. The analysis explores the relationship between various factors (study habits, socioeconomic background, support systems) and exam scores, then builds predictive models to identify at-risk students.

Research Questions

Which factors most significantly influence exam performance?
What is the relative importance of controllable vs. uncontrollable factors?
Can we build a predictive model to identify at-risk students early?
Are there unexpected patterns or anomalies in student performance?

Project Structure

Data-Science-Student-Performance/
├── README.md                          # This file
├── requirements.txt                   # Python dependencies
├── .gitignore                        # Files to exclude from Git
│
├── data/
│   ├── raw/
│   │   └── StudentPerformanceFactors.csv      # Original dataset
│   ├── processed/                    # Cleaned data (generated)
│   └── README.md                     # Data description & source
│
├── notebooks/
│   └── Student_performance_factors.ipynb      # Complete analysis notebook
│
├── models/                           # Saved trained models (generated)
│   └── best_model.pkl
│
├── reports/
│   └── figures/                      # Visualizations (generated)
│
├── results/                          # Model metrics & predictions (generated)
│
└── docs/                             # Additional documentation

Dataset

Source: StudentPerformanceFactors.csv
Size: 6,607 records (6,604 after cleaning)
Features: 20 original features (7 numerical, 13 categorical)
Target Variable: Exam_Score (0-100)

Key Features:

Numerical: Hours_Studied, Attendance, Previous_Scores, Sleep_Hours, Tutoring_Sessions, Physical_Activity
Categorical: Parental_Involvement, Access_to_Resources, Motivation_Level, Family_Income, Teacher_Quality, School_Type, Gender, etc.

Methodology

Data Collection & Inspection - Load and explore dataset structure
Data Cleaning - Handle missing values, remove anomalies
Exploratory Data Analysis (EDA)
- Univariate analysis with statistical tests (Shapiro-Wilk, skewness, kurtosis)
- Bivariate analysis (correlation, scatter plots, box plots)
- Outlier detection using IQR method
Feature Engineering - Create 7 new features (Study_Efficiency, Support_Score, etc.)
Model Training - Train 3 models: Linear Regression, Decision Tree, Neural Network
Model Evaluation - Compare models using R², RMSE, MAE metrics

Key Findings

Best Model: Decision Tree (R² = ~0.98, RMSE = ~1.5 points)
Top Predictors: Previous_Scores, Hours_Studied, Attendance, Motivation_Level, Parental_Involvement
Controllable factors (study hours, attendance) have stronger impact than uncontrollable factors (family income, distance)
High motivation adds ~10-15 points to exam scores on average

Setup Instructions

Prerequisites

Python 3.8 or higher
Jupyter Notebook or JupyterLab

Installation

Clone this repository:

git clone <repository-url>
cd Data-Science-Student-Performance

Install required dependencies:
```
pip install -r requirements.txt
```
Launch Jupyter Notebook:
```
jupyter notebook
```
Open notebooks/Student_performance_factors.ipynb and run all cells

Usage

Explore the Data: Run cells in Student_performance_factors.ipynb sequentially
View Visualizations: All plots are generated inline in the notebook
Model Training: Models are trained automatically when running the notebook
Make Predictions: Use the trained models to predict exam scores for new students

Results

Model Performance:
- Linear Regression: R² ≈ 0.80
- Decision Tree: R² ≈ 0.98 (Best)
- Neural Network: R² ≈ 0.85
Feature Importance: Previous_Scores > Hours_Studied > Attendance > Motivation_Level

Technologies Used

Python 3.x
Data Analysis: pandas, numpy
Visualization: matplotlib, seaborn
Statistical Tests: scipy.stats
Machine Learning: scikit-learn (LinearRegression, DecisionTreeRegressor, MLPRegressor)

Project Status

✅ Complete - Ready for submission

Author

WIA1007 Data Science Course Project

License

This project is for educational purposes only.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Student Performance Analysis - Data Science Project

Overview

Research Questions

Project Structure

Dataset

Key Features:

Methodology

Key Findings

Setup Instructions

Prerequisites

Installation

Usage

Results

Technologies Used

Project Status

Author

License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
notebooks		notebooks
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Tanvir-h-simon/Student-Performance-Analysis

Folders and files

Latest commit

History

Repository files navigation

Student Performance Analysis - Data Science Project

Overview

Research Questions

Project Structure

Dataset

Key Features:

Methodology

Key Findings

Setup Instructions

Prerequisites

Installation

Usage

Results

Technologies Used

Project Status

Author

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages