The steps I took to setup the environment and to run the analysis are described below.
1. Configure NAISS (National Academic Infrastructure for Supercomputing in Sweden) high performance cluster environment apptainers
I ran the following from the /mimer/NOBACKUP/groups/globalpoverty1/cindy/recipes/ directory. In GitHub, the scripts mentioned below are found at ./env_conf. If rerunning, adjust the commands for your directory structure.
- Move current image to a backup name, so the new image can be created: mv ../images/final_upg_causalimages.sif ../images/final_upg_causalimages.bck
- Build the new image with Connor's updated software apptainer build ../images/final_upg_causalimages.sif final_upg_causalimages.def &> final_upg_causalimages.log
- Review logfile created by previous step to ensure there were no build errors
- Edit the final.def file in the recipes directory to include the package in the line starting with R --slave -e 'install.packages(c("areal".... (keep them in alphabetical order to avoid duplicates)
- Move current image to a backup name, so the new image can be created: mv ../images/final.sif ../images/final.bck
- Build the new image: apptainer build ../images/final.sif final.def &> final.log
- Review logfile created by previous step to ensure there were no build errors
- Rebuild the final_upg_causalimages.sif using the steps in the first paragraph
- Move current image to a backup name, so the new image can be created: mv ../images/cindy_python.sif ../images/cindy_python.bck
- Build the new image with any python environment changes: apptainer build ../images/cindy_python.sif cindy_python.def &> cindy_python.log
- Review logfile created by previous step to ensure there were no build errors
-
To download daytime satellite images over all neighborhood units of analysis for the study period via a jupyter notebook:
./python/0_download_images_c1_5k_3yr.ipynb
This relies upon
- ./python/satellite_sampling_5k_3y.py
- ./python/gee_exporter_c1_5k.py
Note that I also downloaded images from collection 2, of different resolutions, and of different timeframes, so those scripts are also available in the same directory.
-
To download a csv file with per capita nightlight info over points used in this study
- ./python/0_download_percapita_nl_harmonized_5k_WorldPop.ipynb
-
Prepare basic data used by the rest of the analysis:
- ./R/prep_iso_codes.R
- ./R/prep_projects.R
- ./R/prep_dhs_points.R
-
Determine which neighborhoods are treatments or controls for each funder and sector
- ./R/select_dhs_treat_control_5k.R
- To compare the number of observations if I had used actual, cross-sectional DHS survey wealth measures, I used the script ./R/select_dhs_treat_control_actual.R
-
Prepare confounder data for each neighborhood
- ./R/prep_confounders_dhs_5k_raster.R
- ./R/prep_confounder_disasters.R
- ./R/prep_confounders_dhs_vector.R
- ./R/prep_confounders_dhs_natl_res.R
- ./R/prep_confounders_dhs_country.R
- ./R/prep_confounders_dhs_loan_projects.R
To consolidate all confounder data in a wide format, one row per dhs point
- ./R/consolidate_confounders_wide_5k_dhs.R
Note that descriptive statistics specific to a funder-sector are produced by each funder-sector analysis in step 5. The steps below are for all other figures and tables in the study.
- ./R/prep_desc_stats.R
- ./R/chart_projects.R
- ./R/chart_dhs_projs.R
- ./R/map_dhs_proj_countries.R
-
For the Convolutional Neural Network analysis on 3-year images
- ./scripts/run_cnn_3yr.sh is a bash shell script that submits a funder/sector analysis each minute so the funder_sector analyses can run in parallel
- by calling the SLURM script ./scripts/call_CI_Conf_5k_3yr.slurm
- which calls the R script ./R/call_CI_Conf_5k_3yr.R
-
For the Randomized Embedding analysis on 3-year images, the script is ./scripts/run_emb_3yr.sh and calls the same code as above, but with different parameters.
In addition to running the Image Confounding Analysis, this R script also
- Constructs the final panel data for each sector and funder
- Writes the tfrecord files to speed up the analysis
- Produces the maps, boxplots, scatterplots and other funder/sector specific figures
- Runs a Ridge Regression on the tabular-only covariates, produces a treatment propensity overlap chart, and estimates an Average Treatment Effect with 100 bootstrap samples and takes the standard deviation to estimate the standard error.
Note that the directory also has parallel scripts to run the analysis on annual images and to run with the vision transformer backbone, which did not make it into the thesis due to time and space constraints.
These runs create a subdirectory named after the run, where all the output files are placed. The output files for this thesis are in the following locations:
- ./results/cnn_3yr
- ./results/emb_3yr
-
To rename output files so they would sort appropriately in a consolidated output file I reviewed for each run, I ran the following bash shell script on the NAISS server:
- ./scripts/rename_output.sh
-
Use WinSCP to copy the files to a results directory on a laptop.
-
Use the following Windows batch scripts and Image Magick and PDFTK software to create a consolidated pdf results file for each funder. These scripts convert png files to pdf files and then consolidate the files into a single pdf for each funder
- .\scripts\combine_results_png_pdf.bat (run once for each vision backbone and funder)
Other utility scripts include
- Combine only treatment propensity charts using .\scripts\combine_results_treatprop.bat
- Combine both loss figures and treatment propensity charts using .\scripts\combine_loss_hist.bat
-
Consolidate results across runs and prepare cross-run ATE figures, Salience Charts, and AUC statistics
- ./R/consolidate_CI_output_across_runs_cnn_emb.R