Mac/Linux only.
Rscript pprs.R "
# (Required arguments)
--geno_files <f1.bgen f2.bcf data/*.vcf>
--score_file <my_cluster_weights.csv>
--score_file_chr_col <colname>
--score_file_pos_col <colname>
--score_file_ref_col <colname>
--score_file_alt_col <colname>
--score_file_ea_col <colname>
--score_file_weight_cols <colname(s)>
# (Optional arguments)
--sample_file <my_bgen_samps.sample>
--proxy_r2_cutoff <between 0-1, proxy minimum R^2> (default 0.8)
--proxy_winsize_kb <distance to search for proxies> (default 100)
--proxy_sample_pop <population code like AFR>
--allow_allele_flips
--bcftools_exe <path/to/bcftools> (default: bcftools or auto-installation)
--bgenix_exe <path/to/bgenix> (default: bgenix or auto-installation)
--plink2_exe <path/to/plink2> (default: plink2 or auto-installation)
--scratch_folder (default: scratch/)
--output_fnm (default: my_results.txt)
--threads (default: 1)
--memory_mb (default: 8000)
"
--geno_files:.vcf[.gz]/.bcf[.gz],gds,.bgen,.bed+.bim+.fam, or.pgen+.pvar+.psamfile formats.- Can specify (multiple) files, or patterns e.g.
data/*.bgen. - VCF files can be accessed directly through URLs. See examples.
--sample_file: an accompanying.samplefile is required if using.bgenfile(s).- If there are variants with duplicate chr,pos,ref,alt in your genotype data, the first will be chosen.
- Can specify (multiple) files, or patterns e.g.
--score_fileMust contain columns for chromosome, position, reference allele, alternate allele, effect allele, and at least one column of weights.- This pipline automatically replaces
score_filevariants missing from yourgeno_fileswith suitable proxies from 1000 Genoems, based on the following parameters:--proxy_r2_cutoff: The minimum R^2 correlation that proxies are allowed to have.--proxy_winsize_kb: The larger the window size, the longer it will take to calculate proxies.--proxy_sample_pop: See here the list of accepted codes.
--allow_allele_flipsAllows variants in your score file to have ambiguous ref/alt allele. Useful if you only know the effect allele and non-effect alleles.
- R (>=4.1)
- Packages:
install.packages(c("data.table","LDlinkR","parallel","XML")
- Packages:
Other dependencies are automatically installed as needed. (They will be installed to the current directory. It is expected your system has basic utilities like curl and make to download and build the needed software).
If you plan to run this pipeline repeatedly on the cloud or on a compute cluster, consider using an environment with these additional dependencies pre-installed so you don't waste time installing them each run:
plink2- (If using
.bgenfiles)bgenix - (If using
.vcf/.bcffiles)bcftools - (If using
.gdsfiles) SeqArray R package
- LDlink: Machiela MJ, Chanock SJ. LDlink: a web-based application for exploring population-specific haplotype structure and linking correlated alleles of possible functional variants. Bioinformatics. 2015 Jul 2.
- PLINK2: Chang CC, Chow CC, Tellier LCAM, Vattikuti S, Purcell SM, Lee JJ (2015) Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience, 4.
- BCFtools: Danecek P, Bonfield JK, et al. Twelve years of SAMtools and BCFtools. Gigascience (2021) 10(2):giab008
- BGEN: Band, G. and Marchini, J., BGEN: a binary file format for imputed genotype and haplotype data bioArxiv 308296
- SeqArray: Zheng X, Gogarten S, Lawrence M, Stilp A, Conomos M, Weir B, Laurie C, Levine D (2017). SeqArray – A storage-efficient high-performance data format for WGS variant calls. Bioinformatics, 33(15), 2251-2257.
- Small examples for all file types
- Tell Broad cluster users to
use GCC-5.2oruse Bcftoolsif bcftools fails to build. Or give general advice to enterprise linux users that they'll need gcc>=5.2.