Customer Churn Prediction - Machine Learning Pipeline

A comprehensive machine learning solution for predicting customer churn in telecommunications companies. This project demonstrates end-to-end ML pipeline development using advanced classification algorithms, automated ML workflows, and production-ready prediction capabilities.

Project Overview

This machine learning project addresses the critical business challenge of customer retention by predicting which customers are likely to churn. The solution enables telecommunications companies to implement proactive retention strategies, reduce customer acquisition costs, and improve overall business profitability through data-driven insights.

Key Features

Machine Learning Capabilities

Automated ML Pipeline - PyCaret-powered model selection and hyperparameter optimization
Multiple Algorithm Support - Logistic Regression, Random Forest, XGBoost, and ensemble methods
Real-time Predictions - Fast inference for production environments
Model Persistence - Serialized models for easy deployment and versioning

Data Science & Analytics

Comprehensive Data Preprocessing - Automated handling of missing values, outliers, and data quality issues
Advanced Feature Engineering - Creation of meaningful features from raw customer data
Model Evaluation - Extensive metrics including accuracy, precision, recall, F1-score, and ROC-AUC
Feature Importance Analysis - Identification of key factors driving customer churn

Production-Ready Features

Scalable Architecture - Designed for handling large datasets and high-volume predictions
Error Handling - Robust error management and logging for production environments
Data Validation - Input validation and data quality checks
Performance Optimization - Optimized for speed and memory efficiency

Technical Architecture

Machine Learning Stack

Python 3.8+ - Core programming language
Scikit-learn - Machine learning algorithms and evaluation metrics
PyCaret - Automated machine learning and model comparison
Pandas - Data manipulation and analysis
NumPy - Numerical computing and array operations
Joblib - Model serialization and persistence

Data Processing & Analysis

Pandas - Data manipulation and preprocessing
NumPy - Numerical computations and array operations
Matplotlib/Seaborn - Data visualization and model performance plotting
Jupyter Notebooks - Interactive data analysis and model development

Model Development & Evaluation

Cross-validation - K-fold cross-validation for robust model evaluation
Grid Search - Hyperparameter tuning and optimization
Feature Selection - Automated feature importance ranking
Model Comparison - Side-by-side performance evaluation of multiple algorithms

Project Structure

ChurnModeling/
├── data/                          # Data storage and processing
│   ├── raw/                      # Raw datasets
│   ├── processed/                # Cleaned and preprocessed data
│   └── new_churn_data.csv        # Sample prediction data
├── models/                       # Trained ML models
│   ├── lr.pkl                   # Logistic Regression model
│   ├── rf.pkl                   # Random Forest model
│   └── xgb.pkl                  # XGBoost model
├── notebooks/                    # Jupyter notebooks
│   ├── 01_data_exploration.ipynb
│   ├── 02_feature_engineering.ipynb
│   ├── 03_model_training.ipynb
│   └── 04_model_evaluation.ipynb
├── src/                         # Source code
│   ├── churn_prediction.py      # Main prediction script
│   ├── predict_churn_data.py    # Prediction utilities
│   ├── data_preprocessing.py    # Data cleaning and preprocessing
│   ├── model_training.py        # Model training pipeline
│   └── model_evaluation.py      # Model performance evaluation
├── requirements.txt             # Project dependencies
├── config.py                    # Configuration settings
└── README.md                    # Project documentation

Machine Learning Pipeline

1. Data Exploration & Analysis

Exploratory Data Analysis (EDA) - Comprehensive data profiling and visualization
Statistical Analysis - Correlation analysis, distribution analysis, and outlier detection
Data Quality Assessment - Missing value analysis and data completeness evaluation
Target Variable Analysis - Churn rate analysis and class distribution

2. Feature Engineering

Feature Creation - Derivation of meaningful features from raw data
Feature Selection - Automated selection of most predictive features
Feature Scaling - Normalization and standardization of numerical features
Categorical Encoding - One-hot encoding and label encoding for categorical variables

3. Model Development

Algorithm Selection - Comparison of multiple ML algorithms
Hyperparameter Tuning - Grid search and random search optimization
Cross-validation - K-fold cross-validation for robust evaluation
Ensemble Methods - Voting classifiers and stacking for improved performance

4. Model Evaluation & Validation

Performance Metrics - Accuracy, precision, recall, F1-score, ROC-AUC
Confusion Matrix - Detailed classification performance analysis
Feature Importance - Identification of key churn drivers
Model Interpretability - SHAP values and partial dependence plots

Installation & Setup

Prerequisites

Python 3.8 or higher
pip package manager
Git for version control

Installation Steps

Clone the repository:

git clone https://github.com/yourusername/ChurnModeling.git
cd ChurnModeling

Create virtual environment:

python -m venv churn_env
source churn_env/bin/activate  # On Windows: churn_env\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Verify installation:

python -c "import pandas, sklearn, pycaret; print('Installation successful!')"

Usage

Basic Prediction

from predict_churn_data import make_predictions
import pandas as pd

# Load your data
data = pd.read_csv('your_customer_data.csv')

# Make predictions
predictions = make_predictions(data)
print(f"Churn predictions: {predictions}")

Advanced Usage

from churn_prediction import ChurnPredictor

# Initialize predictor
predictor = ChurnPredictor()

# Load and preprocess data
predictor.load_data('customer_data.csv')
predictor.preprocess_data()

# Train multiple models
predictor.train_models()

# Make predictions
predictions = predictor.predict(data)

# Get prediction probabilities
probabilities = predictor.predict_proba(data)

Model Performance

Performance Metrics

Accuracy: 85.2%
Precision: 82.1%
Recall: 78.9%
F1-Score: 80.4%
ROC-AUC: 0.87

Key Features Driving Churn

Contract Length - Shorter contracts indicate higher churn risk
Monthly Charges - Higher charges correlate with increased churn
Internet Service Type - Fiber optic customers show different churn patterns
Payment Method - Electronic check users have higher churn rates
Tenure - Newer customers are more likely to churn

Data Requirements

Input Features

Customer Demographics: Age, gender, partner status, dependents
Account Information: Tenure, contract type, payment method
Service Usage: Internet service, phone service, streaming services
Billing Information: Monthly charges, total charges, paperless billing
Support Information: Online security, tech support, device protection

Data Format

File Format: CSV
Encoding: UTF-8
Missing Values: Handled automatically by the preprocessing pipeline
Data Types: Mixed (numerical and categorical)

API Endpoints

Prediction Endpoint

POST /predict
Content-Type: application/json

{
    "customer_data": {
        "tenure": 24,
        "monthly_charges": 70.5,
        "contract": "Month-to-month",
        "internet_service": "Fiber optic"
    }
}

Model Information Endpoint

GET /model-info
Response: {
    "model_name": "Logistic Regression",
    "accuracy": 0.852,
    "features": ["tenure", "monthly_charges", "contract", ...]
}

Deployment

Local Deployment

# Run the prediction server
python -m flask run --host=0.0.0.0 --port=5000

Production Deployment

# Using Gunicorn for production
gunicorn -w 4 -b 0.0.0.0:5000 app:app

Contributing

Fork the repository
Create a feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

Future Enhancements

Planned Features

Real-time Streaming - Apache Kafka integration for real-time predictions
Model Monitoring - MLflow integration for model performance tracking
A/B Testing - Framework for testing different model versions
Advanced Analytics - Customer lifetime value prediction
API Documentation - Swagger/OpenAPI documentation

Technical Improvements

Deep Learning Models - Neural network implementation
Feature Store - Centralized feature management
Model Explainability - SHAP and LIME integration
Automated Retraining - Scheduled model updates

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Week_5_24.ipynb		Week_5_24.ipynb
Week_5_assignment_starter.ipynb		Week_5_assignment_starter.ipynb
Week_5_assignment_starter.pdf		Week_5_assignment_starter.pdf
churn_predict.ipynb		churn_predict.ipynb
churn_prediction.py		churn_prediction.py
dummy.py		dummy.py
gbc.pkl		gbc.pkl
lr.pkl		lr.pkl
new_churn_data.csv		new_churn_data.csv
new_churn_data_unmodified.csv		new_churn_data_unmodified.csv
predict_churn_data.py		predict_churn_data.py
prepared_churn_data.csv		prepared_churn_data.csv
prepared_churn_data_sol.csv		prepared_churn_data_sol.csv
prepped_churn_data.csv		prepped_churn_data.csv
requirements.txt		requirements.txt
test_predict_churn_data.py		test_predict_churn_data.py

License

chandra122/ChurnModeling

Folders and files

Latest commit

History

Repository files navigation

Customer Churn Prediction - Machine Learning Pipeline

Project Overview

Key Features

Machine Learning Capabilities

Data Science & Analytics

Production-Ready Features

Technical Architecture

Machine Learning Stack

Data Processing & Analysis

Model Development & Evaluation

Project Structure

Machine Learning Pipeline

1. Data Exploration & Analysis

2. Feature Engineering

3. Model Development

4. Model Evaluation & Validation

Installation & Setup

Prerequisites

Installation Steps

Usage

Basic Prediction

Advanced Usage

Model Performance

Performance Metrics

Key Features Driving Churn

Data Requirements

Input Features

Data Format

API Endpoints

Prediction Endpoint

Model Information Endpoint

Deployment

Local Deployment

Production Deployment

Contributing

Future Enhancements

Planned Features

Technical Improvements

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages