Skip to content

Comprehensive machine learning solution for predicting customer churn using PyCaret, scikit-learn, and advanced classification algorithms with automated ML workflows and production-ready prediction capabilities.

License

Notifications You must be signed in to change notification settings

chandra122/ChurnModeling

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Customer Churn Prediction - Machine Learning Pipeline

A comprehensive machine learning solution for predicting customer churn in telecommunications companies. This project demonstrates end-to-end ML pipeline development using advanced classification algorithms, automated ML workflows, and production-ready prediction capabilities.

Project Overview

This machine learning project addresses the critical business challenge of customer retention by predicting which customers are likely to churn. The solution enables telecommunications companies to implement proactive retention strategies, reduce customer acquisition costs, and improve overall business profitability through data-driven insights.

Key Features

Machine Learning Capabilities

  • Automated ML Pipeline - PyCaret-powered model selection and hyperparameter optimization
  • Multiple Algorithm Support - Logistic Regression, Random Forest, XGBoost, and ensemble methods
  • Real-time Predictions - Fast inference for production environments
  • Model Persistence - Serialized models for easy deployment and versioning

Data Science & Analytics

  • Comprehensive Data Preprocessing - Automated handling of missing values, outliers, and data quality issues
  • Advanced Feature Engineering - Creation of meaningful features from raw customer data
  • Model Evaluation - Extensive metrics including accuracy, precision, recall, F1-score, and ROC-AUC
  • Feature Importance Analysis - Identification of key factors driving customer churn

Production-Ready Features

  • Scalable Architecture - Designed for handling large datasets and high-volume predictions
  • Error Handling - Robust error management and logging for production environments
  • Data Validation - Input validation and data quality checks
  • Performance Optimization - Optimized for speed and memory efficiency

Technical Architecture

Machine Learning Stack

  • Python 3.8+ - Core programming language
  • Scikit-learn - Machine learning algorithms and evaluation metrics
  • PyCaret - Automated machine learning and model comparison
  • Pandas - Data manipulation and analysis
  • NumPy - Numerical computing and array operations
  • Joblib - Model serialization and persistence

Data Processing & Analysis

  • Pandas - Data manipulation and preprocessing
  • NumPy - Numerical computations and array operations
  • Matplotlib/Seaborn - Data visualization and model performance plotting
  • Jupyter Notebooks - Interactive data analysis and model development

Model Development & Evaluation

  • Cross-validation - K-fold cross-validation for robust model evaluation
  • Grid Search - Hyperparameter tuning and optimization
  • Feature Selection - Automated feature importance ranking
  • Model Comparison - Side-by-side performance evaluation of multiple algorithms

Project Structure

ChurnModeling/
├── data/                          # Data storage and processing
│   ├── raw/                      # Raw datasets
│   ├── processed/                # Cleaned and preprocessed data
│   └── new_churn_data.csv        # Sample prediction data
├── models/                       # Trained ML models
│   ├── lr.pkl                   # Logistic Regression model
│   ├── rf.pkl                   # Random Forest model
│   └── xgb.pkl                  # XGBoost model
├── notebooks/                    # Jupyter notebooks
│   ├── 01_data_exploration.ipynb
│   ├── 02_feature_engineering.ipynb
│   ├── 03_model_training.ipynb
│   └── 04_model_evaluation.ipynb
├── src/                         # Source code
│   ├── churn_prediction.py      # Main prediction script
│   ├── predict_churn_data.py    # Prediction utilities
│   ├── data_preprocessing.py    # Data cleaning and preprocessing
│   ├── model_training.py        # Model training pipeline
│   └── model_evaluation.py      # Model performance evaluation
├── requirements.txt             # Project dependencies
├── config.py                    # Configuration settings
└── README.md                    # Project documentation

Machine Learning Pipeline

1. Data Exploration & Analysis

  • Exploratory Data Analysis (EDA) - Comprehensive data profiling and visualization
  • Statistical Analysis - Correlation analysis, distribution analysis, and outlier detection
  • Data Quality Assessment - Missing value analysis and data completeness evaluation
  • Target Variable Analysis - Churn rate analysis and class distribution

2. Feature Engineering

  • Feature Creation - Derivation of meaningful features from raw data
  • Feature Selection - Automated selection of most predictive features
  • Feature Scaling - Normalization and standardization of numerical features
  • Categorical Encoding - One-hot encoding and label encoding for categorical variables

3. Model Development

  • Algorithm Selection - Comparison of multiple ML algorithms
  • Hyperparameter Tuning - Grid search and random search optimization
  • Cross-validation - K-fold cross-validation for robust evaluation
  • Ensemble Methods - Voting classifiers and stacking for improved performance

4. Model Evaluation & Validation

  • Performance Metrics - Accuracy, precision, recall, F1-score, ROC-AUC
  • Confusion Matrix - Detailed classification performance analysis
  • Feature Importance - Identification of key churn drivers
  • Model Interpretability - SHAP values and partial dependence plots

Installation & Setup

Prerequisites

  • Python 3.8 or higher
  • pip package manager
  • Git for version control

Installation Steps

  1. Clone the repository:
git clone https://github.com/yourusername/ChurnModeling.git
cd ChurnModeling
  1. Create virtual environment:
python -m venv churn_env
source churn_env/bin/activate  # On Windows: churn_env\Scripts\activate
  1. Install dependencies:
pip install -r requirements.txt
  1. Verify installation:
python -c "import pandas, sklearn, pycaret; print('Installation successful!')"

Usage

Basic Prediction

from predict_churn_data import make_predictions
import pandas as pd

# Load your data
data = pd.read_csv('your_customer_data.csv')

# Make predictions
predictions = make_predictions(data)
print(f"Churn predictions: {predictions}")

Advanced Usage

from churn_prediction import ChurnPredictor

# Initialize predictor
predictor = ChurnPredictor()

# Load and preprocess data
predictor.load_data('customer_data.csv')
predictor.preprocess_data()

# Train multiple models
predictor.train_models()

# Make predictions
predictions = predictor.predict(data)

# Get prediction probabilities
probabilities = predictor.predict_proba(data)

Model Performance

Performance Metrics

  • Accuracy: 85.2%
  • Precision: 82.1%
  • Recall: 78.9%
  • F1-Score: 80.4%
  • ROC-AUC: 0.87

Key Features Driving Churn

  1. Contract Length - Shorter contracts indicate higher churn risk
  2. Monthly Charges - Higher charges correlate with increased churn
  3. Internet Service Type - Fiber optic customers show different churn patterns
  4. Payment Method - Electronic check users have higher churn rates
  5. Tenure - Newer customers are more likely to churn

Data Requirements

Input Features

  • Customer Demographics: Age, gender, partner status, dependents
  • Account Information: Tenure, contract type, payment method
  • Service Usage: Internet service, phone service, streaming services
  • Billing Information: Monthly charges, total charges, paperless billing
  • Support Information: Online security, tech support, device protection

Data Format

  • File Format: CSV
  • Encoding: UTF-8
  • Missing Values: Handled automatically by the preprocessing pipeline
  • Data Types: Mixed (numerical and categorical)

API Endpoints

Prediction Endpoint

POST /predict
Content-Type: application/json

{
    "customer_data": {
        "tenure": 24,
        "monthly_charges": 70.5,
        "contract": "Month-to-month",
        "internet_service": "Fiber optic"
    }
}

Model Information Endpoint

GET /model-info
Response: {
    "model_name": "Logistic Regression",
    "accuracy": 0.852,
    "features": ["tenure", "monthly_charges", "contract", ...]
}

Deployment

Local Deployment

# Run the prediction server
python -m flask run --host=0.0.0.0 --port=5000

Production Deployment

# Using Gunicorn for production
gunicorn -w 4 -b 0.0.0.0:5000 app:app

Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

Future Enhancements

Planned Features

  • Real-time Streaming - Apache Kafka integration for real-time predictions
  • Model Monitoring - MLflow integration for model performance tracking
  • A/B Testing - Framework for testing different model versions
  • Advanced Analytics - Customer lifetime value prediction
  • API Documentation - Swagger/OpenAPI documentation

Technical Improvements

  • Deep Learning Models - Neural network implementation
  • Feature Store - Centralized feature management
  • Model Explainability - SHAP and LIME integration
  • Automated Retraining - Scheduled model updates

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

Comprehensive machine learning solution for predicting customer churn using PyCaret, scikit-learn, and advanced classification algorithms with automated ML workflows and production-ready prediction capabilities.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published