County-level lung cancer mortality prediction for the contiguous United States using socioeconomic, atmospheric, meteorological, livestock, and smoking predictors.
This project models county-level age-standardized lung cancer mortality from 2012 to 2019 with XGBoost and SHAP-based interpretation. The core question is whether atmospheric predictors remain important after county-level smoking rate is included explicitly in the model.
The main 45-predictor model achieved:
- test
R^2 = 0.875 - test
RMSE = 5.40deaths per 100,000 - test
MAE = 3.91deaths per 100,000
The leading predictors in the full model were:
bachelor's degree or higher (%)smoking rateFoT formaldehyde above the 75th percentilewet-bulb temperaturesulphate aerosol mixing ratio
A comparison model using only smoking, socioeconomic, and demographic predictors performed worse:
- test
R^2 = 0.797 - test
RMSE = 6.88 - test
MAE = 5.12
This comparison shows that the broader predictor set added county-level predictive information beyond smoking plus county-level socioeconomic and demographic predictors.
The project integrates five data sources:
- IHME: county-level age-standardized lung cancer mortality
- ACS 5-year estimates: socioeconomic and demographic predictors
- CAMS EAC4 / ERA5: atmospheric and meteorological predictors
- FAO Gridded Livestock of the World: livestock density predictors
- County Health Rankings: adult smoking rate
Additional geographic input for the choropleth map comes from the 2019 U.S. Census TIGER county geometry.
notebooks/: end-to-end data processing, modeling, supplementary figures, and the smoking-plus-SES comparison modeldata/: raw inputs, processed tables, combined datasets, and modeling outputsgenerate_lung_cancer_choropleth.py: reproducible script for the 2019 lung cancer mortality choropleth
The main workflow is:
00_single_year_lung_cancer_mortality.ipynb01_preprocessing_fips_lung_cancer.ipynb02_fetch_merge_acs_variables.ipynb02b_fetch_merge_smoking_data.ipynb03_combine_features_by_year.ipynb04_cleaning_dataset.ipynb05_combine_all_datasets.ipynb06_feature_analysis_demographics.ipynb07_feature_analysis_weather.ipynb08_create_final_reduced_dataset.ipynb09_xgboost_bayesian_optimization.ipynb10_additional_paper_figures.ipynb11_xgboost_smoking_socioeconomic_only.ipynb
Notebook 09 produces the main model, SHAP rankings, permutation importance, and ablation outputs. Notebook 11 fits the smoking-plus-socioeconomic comparison model.
Important output files include:
data/outputs/modeling/xgboost/table1_metrics_all_features.csvdata/outputs/modeling/xgboost/table5_ablation_comparison.csvdata/outputs/modeling/xgboost_smoking_ses/table1_metrics_smoking_ses.csvpaper/Figures/fig1_lung_cancer_mortality_map_2019.png
To regenerate the lung cancer choropleth map:
conda run -n main_env python generate_lung_cancer_choropleth.pyThis script reads:
data/raw/census_tiger_tl_2019_us_county.zipdata/processed/preprocessed_fips_lung_cancer/preprocessed_lung_cancer_fips_2019.csv
and writes:
paper/Figures/fig1_lung_cancer_mortality_map_2019.png
- Smoking is included explicitly as a predictor.
is_post_2015is a measurement-control variable for the County Health Rankings methodology change and is not a scientific finding.