ML Project - "Pump-It-up"

https://github.com/Yasith-Banuka/ML-Project

ML Project - "Pump-It-up"

Exploratory Analysis

Upon performing an analysis on the dataset, the following characteristics on the features were found.

Amount_tsh - Contains a large number of zeroes. erroneous, actual data, or a combination.
Wpt-name - Extremely high cardinality.
Num_private - 98% zeroes.
recorded by - same value throughout
Funder | installer - Entries are common in most records. Many erroneous entries resulting in high cardinality. Many missing values
Basin | subvillage | region | region_code | district_code | longitude | latitude - Location-based features. Some are of high cardinality.
- Longitude - has erroneous zeroes.
gps_height - contains negative entries (which is impossible i.e. erroneous data). Also contains many zeroes
Following 7 sets of features refer to the same detail, and is identical in most rows. The cardinality is slightly different in each feature
- Scheme_management | management | management_group
- Extraction_type | extraction_type_group | extraction_type_class
- Payment | payment_type
- Water_quality | quality_group
- Quantity | quantity_group
- Source | source_type | source_class
- Waterpoint_type | waterpoint_type_group

Correlation

As expected, all the location features are correlated. So are the 7 sets of similar features.

Feature Selection

Following features were selected based on initial analysis

amount_tsh
funder
longitude
latitude
Basin
Subvillage
Region_code
district_code
lga
ward
population
public_meeting
permit
extraction_type_group
payment
quality_group
quantity
source_class
waterpoint_type_group
management
source_management
construction_year

Following features were removed

date_recorded - no useful information
installer - similar to funder
wpt_name - High cardinality and missing values
num_private - no information
region - similar to region_code
recorded_by - no information
scheme_name - missing values

One feature from each of the 7 groups of features mentioned in the earlier section was picked considering cardinality and target rellevence.

Feature Engineering

Funder

As funder & installer are similar, fuder was chosen and installer was dropped. An extensive cleaning was performed.

lowercase all entries to eliminate any case-based errors.
Fill any mising data available from installer
Fill all remaining missing data and invalid data by 'nan'
Look for the common words in the columns and combine all entries containing those words into a single columns
- E.g. Combine all entries containing 'unicef' under a single category as 'unicef'
Combine all similar funders into one category
- E.g. All governmental funders into one category
Replace all categories with less than 100 entries into 'other' category

Management

Interesting nuances between scheme_management & management was observed, thus a cross feature called 'management_cross' was created. Nost of the entries remained as is, but some new categories were created. Any category with less than 100 entries was put into a 'other' category. Missing values filled with a 'nan' category.

gps_height

Get absolute values to eliminate negative readings

Construction year

removed all erronous entries containing 0
Created a new feature called 'age'
The age of the record in 2013 is recorded in this feature (age = 2013 - construction_year.year)

Permit | Public meeting

Encoded with 0 and 1 for False and True respectively.

Handling missing values

Data type	Columns	Method
Numerical	age longitude gps_height	mean
Boolean	public_meeting permit	mode
Categorical	management_cross subvillage	'nan' category

Catboost

Multiple algorithms were tested for the problem, and catboost performed best among them. Due to the high number of categorical features in the dataset, catboost is a great algorithm to use (catboost has many features to support categorical features). It also helps with many other aspects in the pipeline.

Encoding

Catboost has its own target encoder which is ideal for classification. Since all the categorical features are nominal and of high cardinality, this is the best method.

Regularization

Catboost performs L2 regularization on its models, which is also optimized in hyperparameter tuning.

Hyperparameter tuning

Done using HypoerOpt. The most important parameters were choen to be optimized. Features selected for tuning are:

learning rate
depth of tree
subsampling rate for bagging
model size regularization
feature combination

Cross-validation

A 6-fold cross validation is done using catboost's in-build cross-validation functionality.

Post-Processing

Feature Importance

feature	feature Importance
quantity	35.494263
waterpoint_type_group	22.619132
ward	19.299193
lga	15.545347
extraction_type_group	3.930602
payment	2.698304
age	0.207756
funder	0.205402
source_class	0.000000
quality_group	0.000000
amount_tsh	0.000000
public_meeting	0.000000
district_code	0.000000
basin	0.000000
latitude	0.000000
longitude	0.000000
gps_height	0.000000
population	0.000000

SHAP Values

Following features were found to be less important by the above diagrams, and thus removed from training

population
gps_height
longitude
latitude
basin
subvillage
region_code
permit
public_meeting
district_code

Competition submission

References

Following resources were used in building this solution.

https://towardsdatascience.com/pump-it-up-with-catboost-828bf9eaac68

https://github.com/catboost/tutorials

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.ipynb_checkpoints		.ipynb_checkpoints
Cleaned Datasets		Cleaned Datasets
Datasets		Datasets
Images		Images
catboost_info		catboost_info
.gitattributes		.gitattributes
Auto.ipynb		Auto.ipynb
Preprocessing.ipynb		Preprocessing.ipynb
README.md		README.md
model.ipynb		model.ipynb
output.csv		output.csv
visualize.ipynb		visualize.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ML Project - "Pump-It-up"

Exploratory Analysis

Feature Selection

Feature Engineering

Handling missing values

Catboost

Post-Processing

Competition submission

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ML Project - "Pump-It-up"

Exploratory Analysis

Feature Selection

Feature Engineering

Handling missing values

Catboost

Post-Processing

Competition submission

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages