The aim of this project is to develop a software product that can predict the likelihood of fatal collisions resulting in loss of life based on actual data collected by the Toronto police department over a period of five years. The predictive service will utilize various features such as weather conditions, road conditions, and location to classify incidents as either resulting in a fatality or not.
- Drop duplicates in ACCNUM column
- Replace NaN value in the dataframe to ‘ ‘
- Retrieve the rows which do not have ‘unknown’ value
- Convert TIME column to a new column ‘INTERVAL’ which shows the time period of the accident
- Visualize the object columns
a. Multiple: The number of unique values >2 and < 20
b. Binary: The number of unique values = 2
-
Select the related rows in the dataframe:
a. Not too much unique values (<20)
b. Include sufficient values (missing values/ all values < 0.5)
c. Relate to the condition of accident and its severity (Fatal, Non-Fatal)
-
Replace the values in ACCLASS as {'Non-Fatal':0, 'Fatal':1}
-
Split the dataframe to X (features) and y (target)
-
Implement the data transformation:
a. SimpleImputer(strategy="constant",fill_value='missing')
b. MinMaxScarler()
c. get_dummies()
-
Feature Selection:
a. SelectKBest(score_func = chi2, k = 10)
-
Manage the imbalance classes by oversampling
-
Create pipeline class to streamline the transformers
a. cat_pipline: SimpleImputer, OneHotEncoder
b. num_pipline: SimpleImputer, MinMaxScalar
-
Transform the data based on their dtypes
- Split the data into training set and testing set with portion of 0.8 and 0.2
Feature selection is an important step in machine learning, as it helps to identify the most relevant features for a model. The goal of feature selection is to remove irrelevant or redundant features, which can lead to overfitting and decrease the accuracy of a model.
In this project, there are four tools and techniques were used for feature selection:
-
SelectKBest: This is a statistical method that selects the K best features based on a given score function. In this project, chi-squared test was used as the score function to select the best features.
-
RandomizedSearchCV: This is a technique used for hyperparameter tuning. It randomly selects combinations of hyperparameters and evaluates their performance using cross-validation.
-
StratifiedShuffleSplit: This is a method for splitting a dataset into training and test sets while preserving the class distribution.
-
Pipeline: This is a method for chaining multiple steps in a machine learning workflow, such as data preprocessing, feature selection, and model training.
The first model (Logistic Regression) with solver 'saga', penalty '11', and C=10 achieved an accuracy of 0.7568, precision of 0.7631, recall of 0.9774, and F1-score of 0.8571. The confusion matrix shows that 95 instances were correctly classified as negative, and 2504 instances were correctly classified as positive.
The second model (The Decision Tree Classifier) with min_samples_split=10, min_samples_leaf=1, max_depth=28, and criterion='gini' achieved an accuracy of 0.7702, precision of 0.7822, recall of 0.9590, and F1-score of 0.8617. The confusion matrix shows that 188 instances were correctly classified as negative, and 2457 instances were correctly classified as positive.
The third model (Random Forest Classifier) with n_estimators=300, min_samples_split=2, min_samples_leaf=2, max_depth=None, and criterion='gini' achieved an accuracy of 0.7708, precision of 0.7815, recall of 0.9617, and F1-score of 0.8623. The confusion matrix shows that 183 instances were correctly classified as negative, and 2464 instances were correctly classified as positive.
The fourth model (MLP Classifier) with solver 'lbfgs', learning_rate='constant', hidden_layer_sizes=(20, 10), alpha=0.01, and activation='tanh' achieved an accuracy of 0.7723, precision of 0.7827, recall of 0.9617, and F1-score of 0.8630. The confusion matrix shows that 188 instances were correctly classified as negative, and 2464 instances were correctly classified as positive.










