Data source : https://www.kaggle.com/mirichoi0218/insurance
• Seek insight from the dataset with Exploratory Data Analysis
• Performed data processing, data engineering to prepare data before modeling
• Built a model to predict Insurance Cost based on the features
• Feature sex, region has an almost balanced amount, meanwhile most people are non smoker & obese

• A person who smoke and have BMI above 30 tends to have a higher medical cost

• Older people who smoke have more expensive charges

• People who smoke and obese have the highest average charges compared to others

• Check missing value - there are none
• Check duplicate value - there are 1 duplicate, will be remove
• Feature engineering - make a new column weight_status based on BMI score
• Feature transformation
Encoding sex, region, & weight_status
Ordinal encoding smoker
• Modeling
Separating target & features
Splitting train & test data
Modeling using Linear Regression, Random Forest, Decision Tree, Ridge, & Lasso algorithm
Find the best algorithm
Tuning Hyperparameter
| Score | LinearRegression | DecisionTree | RandomForest | Ridge |
|---|---|---|---|---|
| MAE | 4305.20 | 2798.83 | 2608.55 | 4311.10 |
| RMSE | 6209.88 | 6067.50 | 4841.88 | 6238.13 |
| R2 | 0.77 | 0.78 | 0.78 | 0.86 |
| Train Accuracy | 0.74 | 1.0 | 0.97 | 0.74 |
| Test Accuracy | 0.77 | 0.78 | 0.86 | 0.77 |
Based on the predictive modeling, Linear Regression algorithm has the best score compared to the others, with MAE Score 4305.20, RMSE Score 6209.88, & R2 Score 0.77. Linear Regression algorithm is fit based on the train & test accuracy.