This project explores Breast Cancer Wisconsin (Diagnostic) data, focusing on exploratory data analysis, dimensionality reduction using PCA, and supervised classification modeling to predict breast cancer diagnosis.
The project contains three main Jupyter notebooks:
-
Exploratory Data Analysis (
EDA.ipynb)- Detailed exploratory analysis of dataset features.
- Data visualization, correlation analysis, and feature distribution analysis.
-
Unsupervised Learning: PCA (
unsupervised_learning.ipynb)- Principal Component Analysis (PCA) applied to reduce dimensionality.
- Analysis of variance explained by principal components.
- Visualization of PCA components.
-
Supervised Learning: Classification (
supervised_learning.ipynb)- Logistic regression and other classification models for predicting diagnosis.
- Performance evaluation using accuracy, precision, recall, and F1-score.
- Optimization via hyperparameter tuning.
The dataset utilized in this project is the Breast Cancer Wisconsin (Diagnostic) dataset, consisting of various features describing cell characteristics, with labels indicating malignant or benign tumors.
- Data manipulation: Pandas, NumPy
- Visualization: Matplotlib, Seaborn
- Machine Learning: Scikit-learn, Statsmodels
- Interactive Environment: Jupyter Notebook
- Understand and visualize breast cancer data.
- Perform feature selection and dimensionality reduction.
- Develop accurate predictive models for diagnosis classification.
- Clearly communicate results through visual and statistical summaries.
Clone this repository:
git clone https://github.com/v4nui/breast-cancer-classification.git
cd breast-cancer-classificationpip install -r requirements.txtjupyter notebook- Breast Cancer Wisconsin (Diagnostic) Dataset on Kaggle
- Scikit-learn Documentation
- PCA Explained
- Logistic Regression In-depth Guide
- Ironhack learning materials
- StatQuest YouTube Channel
- OpenAI ChatGPT
For questions or feedback, please reach out to vanuhi@live.com.