This project implements a Machine Learning pipeline using Apache Spark (PySpark) and MLlib to predict customer churn in the telecommunications industry. The goal is to identify customers likely to stop using the service based on their behavioral and demographic data, allowing businesses to take proactive retention measures.
Customer churn is a critical metric for business sustainability. This solution processes customer data to build and compare classification models. The project covers the entire data science lifecycle, including data cleaning, feature engineering with Spark pipelines, model training, and evaluation using industry-standard metrics.
The analysis uses the Telco Customer Churn dataset. Due to repository size optimization, the data file is not included in this repository.
Source: Telco Customer Churn on Kaggle
- Download the dataset from the link above.
- Rename the file to
Telco-Customer-Churn.csv(if necessary). - Place the file in the root directory of this project (same folder as
churn_project.py).
The project was developed in Visual Studio Code on Windows. To ensure reproducibility and dependency isolation, a Python virtual environment was utilized. The computational core relies on PySpark for large-scale data processing, supported by Eclipse Temurin JDK 17.
- Python 3.11 or newer.
- Java 8 or 17 (JDK) is required for Apache Spark to run.
Follow these steps to set up the project locally:
-
Clone the repository:
git clone <repository_url> cd <repository_name>
-
Create and activate the virtual environment (Windows):
python -m venv .venv .\.venv\Scripts\Activate.ps1
-
Install the required Python libraries:
pip install pyspark pandas numpy matplotlib scikit-learn
Once the dependencies are installed and the CSV file is placed in the root folder, execute the analysis script via the terminal:
python churn_project.py
π Results
The models were evaluated on a test set (20% of the data), with Logistic Regression yielding slightly better performance for this specific iteration.
Model Accuracy F1 Score
Logistic Regression 80.77% 80.14%
Random Forest 80.03% 78.56%
Key Insights
The feature importance analysis suggests that:
Contract Type is the most significant predictor of churn; customers with month-to-month contracts are at the highest risk.
Tenure is crucial; new customers are far more likely to leave than long-term ones.
Fiber Optic internet users show a higher churn rate compared to DSL users.