This repository contains the code for reproducing simulation and real data analysis results of the paper "Causal Inference for Genomic Data with Multiple Heterogeneous Outcomes".
causarray: The main module for the proposed causal inference method. The package in the current repo is an early version for the purpose of reproducibility; please check out the latest version ofcausarraythat is under development.
- ex1: Simulation with Poisson DGP with sample splitting
generate_simu_data.py: Utility function that generates simulated data.run_simu.py: Run simulation.simu.ipynb: Plot the simulation results.
- ex2: LUHMES CRISPR data
LUHMES.ipynb: LUHMES data analysis.
- ex3: Lupus data
Lupus.ipynb: Lupus data analysis.
The following Python packages are required for the reproducibility workflow.
| Package | Version |
|---|---|
| anndata | 0.9.2 |
| cvxpy | 1.1.18 |
| h5py | 3.1.0 |
| joblib | 1.1.0 |
| jupyter | 1.0.0 |
| matplotlib | 3.4.3 |
| numpy | 1.22.0 |
| pandas | 1.3.3 |
| python | 3.8.12 |
| scanpy | 1.9.3 |
| scikit-learn | 1.1.2 |
| scipy | 1.10.1 |
| seaborn | 0.13.0 |
| statsmodels | 0.13.5 |
| tqdm | 4.62.3 |
For simulation studies, the workflow is as follows:
- Run script
run_simu.pyto perform simulation. The script takes two arguments: the simulation scenarios (0 for mean shifts and 1 for median shifts) and the number of folds (1 for no splitting and 5 for 5-fold cross-fitting). The results will be stored in the folderresult/simu/. - Use
simu.ipynbto reproduce the figures (Figures 1 and E1-E2) based on the previous results.
For real data analysis on LUHMES data, the workflow is as follows:
- (Optional) Preprocessing
- Store the raw data file as the folder
data/LUHMES/GSE142078_RAWfrom the original paper [1]. - Store the supplemental code as the folder
data/LUHMES/Supplemental_Codefrom the original paper [1]. - Run
data/LUHMES/preprocess_data.Rto preprocess the LUHMES data, and the resulting data will be stored asdata/LUHMES/LUHMES.h5, which is available in the current project folder.
- Store the raw data file as the folder
- Use
LUHMES.ipynbto run the proposed causal inference method and reproduce the figures (Figures 3-4 and G3-G7) and tables (Tables E1-E2).
For real data analysis on Lupus data, the workflow is as follows:
- Obtain the h5ad file of the lupus data from the authors of the original paper [2] and store it in the folder
data/lupus/GSE174188_CLUES1_adjusted.h5ad. - Run
ex4_preprocess_lupus.pyof the GitHub repo of paper [3] to preprocess the lupus data. - Use
Lupus.ipynbto run the proposed causal inference method and reproduce the figures (Figures 5-6, G3-G7, and E3).
- [1] Lalli, M. A., Avey, D., Dougherty, J. D., Milbrandt, J., & Mitra, R. D. (2020). "High-throughput single-cell functional elucidation of neurodevelopmental disease–associated genes reveals convergent mechanisms altering neuronal differentiation. Genome research, 30(9), 1317-1331.
- [2] Perez, Richard K., et al. "Single-cell RNA-seq reveals cell type–specific molecular and genetic associations to lupus." Science 376.6589 (2022): eabf1970.
- [3] Jin-Hong Du, Larry Wasserman, and Kathryn Roeder. "Simultaneous inference for generalized linear models with unmeasured confounders." arXiv preprint arXiv:2309.07261 (2023).