Telco Customer Churn Prediction with PySpark

This project implements a Machine Learning pipeline using Apache Spark (PySpark) and MLlib to predict customer churn in the telecommunications industry. The goal is to identify customers likely to stop using the service based on their behavioral and demographic data, allowing businesses to take proactive retention measures.

📊 Project Overview

Customer churn is a critical metric for business sustainability. This solution processes customer data to build and compare classification models. The project covers the entire data science lifecycle, including data cleaning, feature engineering with Spark pipelines, model training, and evaluation using industry-standard metrics.

📂 Dataset

The analysis uses the Telco Customer Churn dataset. Due to repository size optimization, the data file is not included in this repository.

Source: Telco Customer Churn on Kaggle

How to use the dataset:

Download the dataset from the link above.
Rename the file to Telco-Customer-Churn.csv (if necessary).
Place the file in the root directory of this project (same folder as churn_project.py).

🛠️ Development Environment

The project was developed in Visual Studio Code on Windows. To ensure reproducibility and dependency isolation, a Python virtual environment was utilized. The computational core relies on PySpark for large-scale data processing, supported by Eclipse Temurin JDK 17.

Prerequisites

Python 3.11 or newer.
Java 8 or 17 (JDK) is required for Apache Spark to run.

Installation and Setup

Follow these steps to set up the project locally:

Clone the repository:

git clone <repository_url>
cd <repository_name>

Create and activate the virtual environment (Windows):
```
python -m venv .venv
.\.venv\Scripts\Activate.ps1
```

Install the required Python libraries:

pip install pyspark pandas numpy matplotlib scikit-learn

🚀 How to Run

Once the dependencies are installed and the CSV file is placed in the root folder, execute the analysis script via the terminal:

python churn_project.py
📈 Results
The models were evaluated on a test set (20% of the data), with Logistic Regression yielding slightly better performance for this specific iteration.
Model	Accuracy	F1 Score
Logistic Regression	80.77%	80.14%
Random Forest	80.03%	78.56%
Key Insights
The feature importance analysis suggests that:
Contract Type is the most significant predictor of churn; customers with month-to-month contracts are at the highest risk.
Tenure is crucial; new customers are far more likely to leave than long-term ones.
Fiber Optic internet users show a higher churn rate compared to DSL users.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
README.md		README.md
churn_project.py		churn_project.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Telco Customer Churn Prediction with PySpark

📊 Project Overview

📂 Dataset

How to use the dataset:

🛠️ Development Environment

Prerequisites

Installation and Setup

🚀 How to Run

About

Uh oh!

Releases

Packages

Languages

ROXIIIR/Telco-Churn-PySpark

Folders and files

Latest commit

History

Repository files navigation

Telco Customer Churn Prediction with PySpark

📊 Project Overview

📂 Dataset

How to use the dataset:

🛠️ Development Environment

Prerequisites

Installation and Setup

🚀 How to Run

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages