https://github.com/Yasith-Banuka/ML-Project
Upon performing an analysis on the dataset, the following characteristics on the features were found.
-
Amount_tsh - Contains a large number of zeroes. erroneous, actual data, or a combination.
-
Wpt-name - Extremely high cardinality.
-
Num_private - 98% zeroes.
-
recorded by - same value throughout
-
Funder | installer - Entries are common in most records. Many erroneous entries resulting in high cardinality. Many missing values
-
Basin | subvillage | region | region_code | district_code | longitude | latitude - Location-based features. Some are of high cardinality.
- Longitude - has erroneous zeroes.
-
gps_height - contains negative entries (which is impossible i.e. erroneous data). Also contains many zeroes
-
Following 7 sets of features refer to the same detail, and is identical in most rows. The cardinality is slightly different in each feature
- Scheme_management | management | management_group
- Extraction_type | extraction_type_group | extraction_type_class
- Payment | payment_type
- Water_quality | quality_group
- Quantity | quantity_group
- Source | source_type | source_class
- Waterpoint_type | waterpoint_type_group
Correlation
As expected, all the location features are correlated. So are the 7 sets of similar features.
Following features were selected based on initial analysis
- amount_tsh
- funder
- longitude
- latitude
- Basin
- Subvillage
- Region_code
- district_code
- lga
- ward
- population
- public_meeting
- permit
- extraction_type_group
- payment
- quality_group
- quantity
- source_class
- waterpoint_type_group
- management
- source_management
- construction_year
Following features were removed
- date_recorded - no useful information
- installer - similar to funder
- wpt_name - High cardinality and missing values
- num_private - no information
- region - similar to region_code
- recorded_by - no information
- scheme_name - missing values
One feature from each of the 7 groups of features mentioned in the earlier section was picked considering cardinality and target rellevence.
Funder
As funder & installer are similar, fuder was chosen and installer was dropped. An extensive cleaning was performed.
- lowercase all entries to eliminate any case-based errors.
- Fill any mising data available from installer
- Fill all remaining missing data and invalid data by 'nan'
- Look for the common words in the columns and combine all entries containing those words into a single columns
- E.g. Combine all entries containing 'unicef' under a single category as 'unicef'
- Combine all similar funders into one category
- E.g. All governmental funders into one category
- Replace all categories with less than 100 entries into 'other' category
Management
Interesting nuances between scheme_management & management was observed, thus a cross feature called 'management_cross' was created. Nost of the entries remained as is, but some new categories were created. Any category with less than 100 entries was put into a 'other' category. Missing values filled with a 'nan' category.
gps_height
- Get absolute values to eliminate negative readings
Construction year
- removed all erronous entries containing 0
- Created a new feature called 'age'
- The age of the record in 2013 is recorded in this feature (age = 2013 - construction_year.year)
Permit | Public meeting
- Encoded with 0 and 1 for False and True respectively.
| Data type | Columns | Method |
|---|---|---|
| Numerical | age longitude gps_height |
mean |
| Boolean | public_meeting permit |
mode |
| Categorical | management_cross subvillage |
'nan' category |
Multiple algorithms were tested for the problem, and catboost performed best among them. Due to the high number of categorical features in the dataset, catboost is a great algorithm to use (catboost has many features to support categorical features). It also helps with many other aspects in the pipeline.
Encoding
Catboost has its own target encoder which is ideal for classification. Since all the categorical features are nominal and of high cardinality, this is the best method.
Regularization
Catboost performs L2 regularization on its models, which is also optimized in hyperparameter tuning.
Hyperparameter tuning
Done using HypoerOpt. The most important parameters were choen to be optimized. Features selected for tuning are:
- learning rate
- depth of tree
- subsampling rate for bagging
- model size regularization
- feature combination
Cross-validation
A 6-fold cross validation is done using catboost's in-build cross-validation functionality.
Feature Importance
| feature | feature Importance |
|---|---|
| quantity | 35.494263 |
| waterpoint_type_group | 22.619132 |
| ward | 19.299193 |
| lga | 15.545347 |
| extraction_type_group | 3.930602 |
| payment | 2.698304 |
| age | 0.207756 |
| funder | 0.205402 |
| source_class | 0.000000 |
| quality_group | 0.000000 |
| amount_tsh | 0.000000 |
| public_meeting | 0.000000 |
| district_code | 0.000000 |
| basin | 0.000000 |
| latitude | 0.000000 |
| longitude | 0.000000 |
| gps_height | 0.000000 |
| population | 0.000000 |
SHAP Values
Following features were found to be less important by the above diagrams, and thus removed from training
- population
- gps_height
- longitude
- latitude
- basin
- subvillage
- region_code
- permit
- public_meeting
- district_code
Following resources were used in building this solution.
https://towardsdatascience.com/pump-it-up-with-catboost-828bf9eaac68
