predicting-lung-cancer

County-level lung cancer mortality prediction for the contiguous United States using socioeconomic, atmospheric, meteorological, livestock, and smoking predictors.

Overview

This project models county-level age-standardized lung cancer mortality from 2012 to 2019 with XGBoost and SHAP-based interpretation. The core question is whether atmospheric predictors remain important after county-level smoking rate is included explicitly in the model.

Main Result

The main 45-predictor model achieved:

test R^2 = 0.875
test RMSE = 5.40 deaths per 100,000
test MAE = 3.91 deaths per 100,000

The leading predictors in the full model were:

bachelor's degree or higher (%)
smoking rate
FoT formaldehyde above the 75th percentile
wet-bulb temperature
sulphate aerosol mixing ratio

A comparison model using only smoking, socioeconomic, and demographic predictors performed worse:

test R^2 = 0.797
test RMSE = 6.88
test MAE = 5.12

This comparison shows that the broader predictor set added county-level predictive information beyond smoking plus county-level socioeconomic and demographic predictors.

Data Sources

The project integrates five data sources:

IHME: county-level age-standardized lung cancer mortality
ACS 5-year estimates: socioeconomic and demographic predictors
CAMS EAC4 / ERA5: atmospheric and meteorological predictors
FAO Gridded Livestock of the World: livestock density predictors
County Health Rankings: adult smoking rate

Additional geographic input for the choropleth map comes from the 2019 U.S. Census TIGER county geometry.

Repository Layout

notebooks/: end-to-end data processing, modeling, supplementary figures, and the smoking-plus-SES comparison model
data/: raw inputs, processed tables, combined datasets, and modeling outputs
generate_lung_cancer_choropleth.py: reproducible script for the 2019 lung cancer mortality choropleth

Notebook Workflow

The main workflow is:

00_single_year_lung_cancer_mortality.ipynb
01_preprocessing_fips_lung_cancer.ipynb
02_fetch_merge_acs_variables.ipynb
02b_fetch_merge_smoking_data.ipynb
03_combine_features_by_year.ipynb
04_cleaning_dataset.ipynb
05_combine_all_datasets.ipynb
06_feature_analysis_demographics.ipynb
07_feature_analysis_weather.ipynb
08_create_final_reduced_dataset.ipynb
09_xgboost_bayesian_optimization.ipynb
10_additional_paper_figures.ipynb
11_xgboost_smoking_socioeconomic_only.ipynb

Notebook 09 produces the main model, SHAP rankings, permutation importance, and ablation outputs. Notebook 11 fits the smoking-plus-socioeconomic comparison model.

Key Outputs

Important output files include:

Reproducibility

To regenerate the lung cancer choropleth map:

conda run -n main_env python generate_lung_cancer_choropleth.py

This script reads:

data/raw/census_tiger_tl_2019_us_county.zip
data/processed/preprocessed_fips_lung_cancer/preprocessed_lung_cancer_fips_2019.csv

and writes:

paper/Figures/fig1_lung_cancer_mortality_map_2019.png

Notes

Smoking is included explicitly as a predictor.
is_post_2015 is a measurement-control variable for the County Health Rankings methodology change and is not a scientific finding.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.cache/fontconfig		.cache/fontconfig
.matplotlib-cache		.matplotlib-cache
data		data
notebooks		notebooks
.gitignore		.gitignore
LICENSE		LICENSE
NOTES.md		NOTES.md
README.md		README.md
generate_lung_cancer_choropleth.py		generate_lung_cancer_choropleth.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

predicting-lung-cancer

Overview

Main Result

Data Sources

Repository Layout

Notebook Workflow

Key Outputs

Reproducibility

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

predicting-lung-cancer

Overview

Main Result

Data Sources

Repository Layout

Notebook Workflow

Key Outputs

Reproducibility

Notes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages