This project conducts a cross-sectional study to analyze health indicators associated with cardiovascular disease (CVD) using a health screening dataset. The aim is to identify significant risk factors (e.g., age, gender, blood pressure, BMI, cholesterol, smoking, alcohol consumption, physical activity) and develop predictive machine learning models to accurately predict CVD presence.
- Data Analysis: Explores relationships between health indicators and CVD using statistical methods (e.g., Chi-square test, Spearman's rank correlation).
- Data Cleaning: Handles outliers using central tendencies, binning, and robust statistical methods.
- Predictive Modeling: Implements machine learning models including Logistic Regression, Decision Tree, Random Forest, and Support Vector Machine (SVM).
- Visualization: Includes box plots, scatterplots, and heatmaps to visualize outliers and data distributions.
The project uses the Health Screening Data.csv dataset, which contains 69,960 entries and 18 columns, including:
- Numerical features: age, height, weight, systolic BP, diastolic BP, BMI, etc.
- Categorical features: gender, cholesterol, glucose, smoking, alcohol consumption, physical activity, CVD status.
- Derived features: BMI category, age group.
The dataset has no missing values, and outliers were addressed using median imputation and IQR-based filtering.
To run the project, ensure you have the following Python libraries installed:
pandas
numpy
matplotlib
seaborn
scikit-learn
scipyInstall dependencies using:
pip install pandas numpy matplotlib seaborn scikit-learn scipy- Clone the repository:
git clone https://github.com/username/repository.git
- Navigate to the project directory:
cd repository - Install the required dependencies:
pip install -r requirements.txt
- Place the
Health Screening Data.csvdataset in the project directory. - Run the main analysis script (e.g.,
analysis.py):python analysis.py
- The script performs:
- Data loading and cleaning.
- Exploratory data analysis (EDA) with visualizations.
- Statistical tests (e.g., Chi-square, Spearman’s correlation).
- Training and evaluation of machine learning models (Logistic Regression, Decision Tree, Random Forest, SVM).
- Results include model performance metrics (accuracy, precision, recall, F1 score) and visualizations.
Example code snippet for model training:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
# Load and preprocess data
df = pd.read_csv("Health Screening Data.csv")
X = df.drop(['id', 'cardio'], axis=1)
y = df['cardio']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Scale features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Train SVM model
model = SVC()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# Evaluate
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("F1 Score:", f1_score(y_test, y_pred))Health Screening Data.csv: Dataset used for analysis.analysis.py: Main script for data cleaning, EDA, statistical tests, and model training.visualizations/: Directory containing generated plots (e.g., box plots, scatterplots, heatmaps).requirements.txt: List of required Python libraries.
- Data Import: Loaded the dataset using pandas.
- Data Cleaning: Renamed columns for clarity, handled outliers using median imputation and IQR-based filtering.
- Exploratory Data Analysis: Visualized data distributions and outliers using box plots and scatterplots.
- Statistical Analysis: Conducted Chi-square tests for categorical variables and Spearman’s correlation for continuous variables.
- Predictive Modeling: Trained and evaluated Logistic Regression, Decision Tree, Random Forest, and SVM models.
- Evaluation: Assessed models using accuracy, precision, recall, and F1 score.
- What are the significant health indicators associated with CVD?
- How do health indicators relate to the presence of CVD?
- Can predictive machine learning models accurately predict CVD based on these indicators?
- Null Hypothesis (H0): No significant relationship exists between health parameters and CVD risk.
- Alternative Hypothesis (H1): A significant relationship exists between health parameters and CVD risk.
Contributions are welcome! To contribute:
- Fork the repository.
- Create a new branch (
git checkout -b feature-branch). - Commit your changes (
git commit -m 'Add feature'). - Push to the branch (
git push origin feature-branch). - Open a Pull Request.
This project is licensed under the MIT License. See the LICENSE file for details.
For questions or feedback, contact the team:
- Deepishka Pemmasani
- Poorya Reddy Vanga
- Surya Tejaswi Mallidi
- Raaijtha Muthyala
- Daniel Adepoju
- Surja Tejaswi Mallidi
Affiliation: Department of Biohealth Informatics, IUPUI