This project aims to analyze the Breast Cancer Wisconsin (Diagnostic) dataset and build machine learning models to predict whether a breast cancer diagnosis is malignant or benign based on the physical characteristics of the tumor.
The dataset contains information about breast cancer diagnoses, with 569 rows and 32 columns. Each row represents a patient, and the columns represent various physical characteristics of the tumors along with the diagnosis label (M = malignant, B = benign).
- UW CS FTP Server: ftp.cs.wisc.edu (path:
/math-prog/cpo-dataset/machine-learn/WDBC/) - UCI Machine Learning Repository: Breast Cancer Wisconsin (Diagnostic) Dataset
- ID number
- Diagnosis: (M = malignant, B = benign)
- Features (3-32): Ten real-valued features computed for each cell nucleus:
- a) Radius: Mean of distances from center to points on the perimeter
- b) Texture: Standard deviation of gray-scale values
- c) Perimeter
- d) Area
- e) Smoothness: Local variation in radius lengths
- f) Compactness: (perimeter^2 / area - 1.0)
- g) Concavity: Severity of concave portions of the contour
- h) Concave points: Number of concave portions of the contour
- i) Symmetry
- j) Fractal dimension: ("coastline approximation" - 1)
The mean, standard error, and "worst" or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features.
- Benign: 357
- Malignant: 212
- None
- Handling Missing Values: Dropped the empty
Unnamed: 32column. - Data Normalization: Normalized features to ensure a similar scale.
- Feature Selection: Selected features with high correlation to the target variable.
- Pandas: Data manipulation and cleaning.
- Scikit-learn: Data normalization and feature selection.
- Visualizations: Histograms, correlation heatmaps, and box plots to understand feature distributions and relationships.
- Key Findings: Significant differences between malignant and benign tumors in features such as
radius_mean,texture_mean,perimeter_mean,area_mean, andconcavity_mean.
- Algorithms Used: Logistic Regression, Support Vector Machine (SVM), Random Forest, K-Nearest Neighbors (KNN).
- Evaluation Metrics: Accuracy, precision, recall, and F1 score.
- Random Forest: Accuracy 96%, Precision 95%, Recall 96%, F1 Score 95%
- SVM: Accuracy 94%, Precision 93%, Recall 94%, F1 Score 93%
- Best Model: Random Forest with the highest overall performance.
- Key Features:
radius_mean,texture_mean, andperimeter_mean. - Implications: Potential for improving early breast cancer detection and diagnosis.
- Hyperparameter tuning, incorporating additional features or datasets, and exploring deep learning models.
- Summary: Machine learning models, especially Random Forest and SVM, can accurately predict breast cancer diagnoses based on tumor characteristics.
- Impact: These models can support medical professionals in early and accurate breast cancer diagnosis, enhancing clinical decision-making.
- UCI Machine Learning Repository: Breast Cancer Wisconsin (Diagnostic) Dataset
- [K. P. Bennett and O. L. Mangasarian: "Robust Linear Programming Discrimination of Two Linearly Inseparable Sets", Optimization Methods and Software 1, 1992, 23-34]