This project aims to discover the factors that influence the likelihood of fatality in motor vehicle crashes in NYC. The data used to conduct this analysis was the NYC Motor Vehicle Collisions dataset, which is a collection of police crash reports in NYC ranging from 2012-2025 that involved injury, death or at least $1,000 in damages.
After preprocessing and feature engineering, a binary target FATAL is defined (1 if a crash involved at least one death, else 0). The analysis includes:
- Pearson correlation to rank factor associations with fatalities
- Relative risk analysis to measure how much each top factor increases fatality likelihood
- Temporal aggregation to examine trends over time and seasonality
- Logistic regression classification to test whether top factors meaningfully predict fatal crashes
- Which factors are most strongly associated with fatal outcomes in crashes in NYC?
- How much do top factors increase the likelihood of a fatal crash compared to crashes without it?
- Do the highest risk factors show meaningful temporal or seasonal patterns?
- Do the same factors remain important when used to predict fatalities in a supervised model?
Influencing Factors in Motor Vehicle Fatalities in NYC
Influencing Factors in Motor Vehicle Fatalities in NYC
git clone https://github.com/swish0621/MVFatalities.git
cd MVFatalities
# create and activate virtual environment
python -m venv venv
source venv/bin/activate # macOS / Linux
# venv\Scripts\activate # Windows
# install dependencies
pip install -r requirements.txtThis project uses the NYC Motor Vehicle Collisions dataset.
Download it manually from: https://data.cityofnewyork.us/Public-Safety/Motor-Vehicle-Collisions-Crashes/h9gi-nx95
Place the CSV in the root directory before running and ensure it is named:
"Motor_Vehicle_Collisions_-_Crashes.csv"
python -m project For faster processing uncomment the line in load.py (may affect results)
# df = df.head(10000)