A Data Science Project using Machine Learning & Exploratory Analysis
This project explores the impacts of various climate indicators—such as temperature, CO₂ emissions, precipitation, humidity, and wind speed—on sea level rise. Using a combination of exploratory data analysis (EDA), feature engineering, outlier treatment, and multiple machine-learning models, the project predicts sea-level variations and examines which environmental factors influence them most strongly.
This repository includes a complete end-to-end workflow:
- Data loading
- Cleaning & feature engineering
- Exploratory data analysis
- Model training (Linear Regression, Random Forest, Decision Tree, SVR)
- Model evaluation
- A prediction interface for generating sea-level rise estimates
The dataset contains the following climate-related fields:
| Column | Description |
|---|---|
| Date | Daily timestamp |
| Location | City or locality |
| Country | Country identifier |
| Temperature | Temperature (°C) |
| CO₂ Emissions | Carbon emissions (tons/year) |
| Sea Level Rise | Rise in sea level (meters) |
| Precipitation | Precipitation (mm) |
| Humidity | Air humidity (%) |
| Wind Speed | Wind speed (m/s) |
To support time-series and seasonal analysis, the following features were extracted from the Date column:
year_month(YYYY-MM)yearmonth
These engineered features are used later in the model training pipeline, along with one-hot encoding of categorical fields such as Country.
Outliers were removed using the Interquartile Range (IQR) method for:
- CO₂ Emissions
- Sea Level Rise
- Temperature
A total of 218 rows were removed, improving model stability and reducing noise.
The notebook includes detailed EDA through:
- Correlation heatmaps
- Scatter plots showing relationships with sea-level rise
- Histograms & boxplots for distribution analysis
- Seasonal and temporal trend visualizations
These analyses reveal which climate indicators correlate most strongly with sea-level variations.
A variety of regression models were trained and evaluated:
A baseline model for detecting linear relationships.
An ensemble learning method capable of modeling complex non-linear interactions.
A simple, interpretable tree-based model using recursive splitting.
A margin-based model effective for non-linear boundaries (after feature scaling).
Each model is assessed using:
- Mean Squared Error (MSE)
- R² Score
- Actual vs. Predicted Scatter Plots
- Actual vs. Predicted Trend Plots
The notebook includes an interactive prediction system that accepts user input for:
- Temperature
- CO₂ emissions
- Precipitation
- Humidity
- Wind speed
- Year
- Month
- Country
These inputs go through the same preprocessing pipeline used during training (including scaling and encoding), ensuring no data leakage.
| Model | Predicted Sea Level Rise |
|---|---|
| Linear Regression | 0.08829 |
| Random Forest | 0.12365 |
| Decision Tree | 0.53591 |
| SVR | -0.06596 |
- Outlier removal improved distribution smoothness and reduced skewness.
- Random Forest performed best overall in accuracy and robustness.
- SVR predictions varied significantly, showing sensitivity to scaling and kernel parameters.
- Using multiple models provides a more comprehensive understanding of possible sea-level rise outcomes.
Potential enhancements include:
- Hyperparameter tuning (GridSearchCV / RandomizedSearchCV)
- Deep learning models (LSTM, GRU) for time-series prediction
- GIS or geographical visualization dashboards
- Country-level climate trend analytics
- Feature importance analysis (SHAP, permutation importance)
├── climate_change_data.csv # Dataset
├── Climate_Analysis.ipynb # Main notebook (EDA, models, predictions)
├── README.md # Project documentation
- Python 3.10+
- pandas
- numpy
- seaborn / matplotlib
- scikit-learn
- wordcloud
- Jupyter Notebook / Google Colab
This project aims to deepen understanding of the global climate crisis using data-driven methods. Special thanks to the providers of open climate datasets and the open-source community.