Skip to content

Yasith-Banuka/Pump-It-up

Repository files navigation

https://github.com/Yasith-Banuka/ML-Project

ML Project - "Pump-It-up"

Exploratory Analysis

Upon performing an analysis on the dataset, the following characteristics on the features were found.

  • Amount_tsh - Contains a large number of zeroes. erroneous, actual data, or a combination.

  • Wpt-name - Extremely high cardinality.

  • Num_private - 98% zeroes.

  • recorded by - same value throughout

  • Funder | installer - Entries are common in most records. Many erroneous entries resulting in high cardinality. Many missing values

  • Basin | subvillage | region | region_code | district_code | longitude | latitude - Location-based features. Some are of high cardinality.

    • Longitude - has erroneous zeroes.
  • gps_height - contains negative entries (which is impossible i.e. erroneous data). Also contains many zeroes

  • Following 7 sets of features refer to the same detail, and is identical in most rows. The cardinality is slightly different in each feature

    • Scheme_management | management | management_group
    • Extraction_type | extraction_type_group | extraction_type_class
    • Payment | payment_type
    • Water_quality | quality_group
    • Quantity | quantity_group
    • Source | source_type | source_class
    • Waterpoint_type | waterpoint_type_group

Correlation

alt text

As expected, all the location features are correlated. So are the 7 sets of similar features.

Feature Selection

Following features were selected based on initial analysis

  • amount_tsh
  • funder
  • longitude
  • latitude
  • Basin
  • Subvillage
  • Region_code
  • district_code
  • lga
  • ward
  • population
  • public_meeting
  • permit
  • extraction_type_group
  • payment
  • quality_group
  • quantity
  • source_class
  • waterpoint_type_group
  • management
  • source_management
  • construction_year

Following features were removed

  • date_recorded - no useful information
  • installer - similar to funder
  • wpt_name - High cardinality and missing values
  • num_private - no information
  • region - similar to region_code
  • recorded_by - no information
  • scheme_name - missing values

One feature from each of the 7 groups of features mentioned in the earlier section was picked considering cardinality and target rellevence.



Feature Engineering

Funder

As funder & installer are similar, fuder was chosen and installer was dropped. An extensive cleaning was performed.

  • lowercase all entries to eliminate any case-based errors.
  • Fill any mising data available from installer
  • Fill all remaining missing data and invalid data by 'nan'
  • Look for the common words in the columns and combine all entries containing those words into a single columns
    • E.g. Combine all entries containing 'unicef' under a single category as 'unicef'
  • Combine all similar funders into one category
    • E.g. All governmental funders into one category
  • Replace all categories with less than 100 entries into 'other' category

Management

Interesting nuances between scheme_management & management was observed, thus a cross feature called 'management_cross' was created. Nost of the entries remained as is, but some new categories were created. Any category with less than 100 entries was put into a 'other' category. Missing values filled with a 'nan' category.

gps_height

  • Get absolute values to eliminate negative readings

Construction year

  • removed all erronous entries containing 0
  • Created a new feature called 'age'
  • The age of the record in 2013 is recorded in this feature (age = 2013 - construction_year.year)

Permit | Public meeting

  • Encoded with 0 and 1 for False and True respectively.

Handling missing values

Data type Columns Method
Numerical age
longitude
gps_height
mean
Boolean public_meeting
permit
mode
Categorical management_cross
subvillage
'nan' category



Catboost

Multiple algorithms were tested for the problem, and catboost performed best among them. Due to the high number of categorical features in the dataset, catboost is a great algorithm to use (catboost has many features to support categorical features). It also helps with many other aspects in the pipeline.

Encoding

Catboost has its own target encoder which is ideal for classification. Since all the categorical features are nominal and of high cardinality, this is the best method.

Regularization

Catboost performs L2 regularization on its models, which is also optimized in hyperparameter tuning.

Hyperparameter tuning

Done using HypoerOpt. The most important parameters were choen to be optimized. Features selected for tuning are:

  • learning rate
  • depth of tree
  • subsampling rate for bagging
  • model size regularization
  • feature combination

Cross-validation

A 6-fold cross validation is done using catboost's in-build cross-validation functionality.


Post-Processing



Feature Importance

feature feature Importance
quantity 35.494263
waterpoint_type_group 22.619132
ward 19.299193
lga 15.545347
extraction_type_group 3.930602
payment 2.698304
age 0.207756
funder 0.205402
source_class 0.000000
quality_group 0.000000
amount_tsh 0.000000
public_meeting 0.000000
district_code 0.000000
basin 0.000000
latitude 0.000000
longitude 0.000000
gps_height 0.000000
population 0.000000


SHAP Values

alt text



Following features were found to be less important by the above diagrams, and thus removed from training

  • population
  • gps_height
  • longitude
  • latitude
  • basin
  • subvillage
  • region_code
  • permit
  • public_meeting
  • district_code



Competition submission



alt text



References

Following resources were used in building this solution.

https://towardsdatascience.com/pump-it-up-with-catboost-828bf9eaac68

https://github.com/catboost/tutorials

About

A Machine Learning solution written in Pyrthon to the Kaggle competition "Pump it up"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors