This project is a classic machine learning classification problem focused on detecting breast cancer based on diagnostic medical measurements. Using the Wisconsin Breast Cancer dataset, a K-Nearest Neighbors (KNN) model is trained and evaluated to classify tumors as either malignant (cancerous) or benign (non-cancerous).
The primary goal is to demonstrate a complete machine learning workflow, from data cleaning and exploratory analysis to model training, evaluation, and refinement using feature scaling.
The dataset used is the Wisconsin Breast Cancer dataset. It contains 569 instances and 30 numeric features, which are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass.
- Source: The data file is included in the
/datadirectory of this repository. - Features: Include measurements like mean radius, mean texture, mean perimeter, mean area, etc.
- Target Variable:
diagnosis(M = malignant, B = benign).
The project is structured in a single Jupyter Notebook (01_Data_Exploration.ipynb) and follows these key steps:
-
Data Loading & Initial Inspection: The dataset is loaded using Pandas and a health check is performed with
.info()and.head(). -
Data Cleaning: An empty, irrelevant column (
Unnamed: 32) is identified and removed. The categoricaldiagnosiscolumn ('M'/'B') is encoded into numerical format (1/0). -
Exploratory Data Analysis (EDA): Visualizations like count plots, histograms, and correlation heatmaps are used to understand feature distributions and their relationships with the target variable.
-
Data Preparation: The dataset is split into features (X) and target (y), and then further divided into training (80%) and testing (20%) sets.
-
Model Training & Evaluation: A K-Nearest Neighbors (KNN) classifier is trained on the training data.
-
Model Improvement: Feature scaling is applied using
StandardScalerto normalize the feature ranges, and the model is re-trained and re-evaluated to demonstrate performance improvement.
The model's performance was evaluated before and after feature scaling, with significant improvements observed after scaling.
- Initial Model Accuracy: ~94.7%
- Final Model Accuracy (with Feature Scaling): 96.5%
Most importantly, the scaled model achieved a perfect recall of 1.00 for the malignant class, meaning it correctly identified all actual malignant tumors in the test set. This highlights the critical importance of preprocessing steps like feature scaling in machine learning.
- Python
- Pandas: For data manipulation and analysis.
- NumPy: For numerical operations.
- Matplotlib & Seaborn: For data visualization.
- Scikit-learn: For model building, preprocessing, and evaluation.
- Jupyter Notebook: For interactive development.
- Clone this repository to your local machine.
- Create and activate a Python virtual environment:
python -m venv venv source venv/bin/activate # On macOS/Linux .\venv\Scripts\activate # On Windows
- Install the required libraries:
pip install pandas numpy matplotlib seaborn scikit-learn jupyter
- Open the Jupyter Notebook
01_Data_Exploration.ipynband run the cells sequentially.