Skip to content

manning-lab/pprs-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

75 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Mac/Linux only.

Rscript pprs.R "
  # (Required arguments)
  --geno_files <f1.bgen f2.bcf data/*.vcf>

  --score_file <my_cluster_weights.csv>

  --score_file_chr_col     <colname>
  --score_file_pos_col     <colname>
  --score_file_ref_col     <colname>
  --score_file_alt_col     <colname>
  --score_file_ea_col      <colname>
  --score_file_weight_cols <colname(s)>

  # (Optional arguments)
  --sample_file <my_bgen_samps.sample>

  --proxy_r2_cutoff  <between 0-1, proxy minimum R^2> (default 0.8)
  --proxy_winsize_kb <distance to search for proxies> (default 100)
  --proxy_sample_pop <population code like AFR>

  --allow_allele_flips

  --bcftools_exe <path/to/bcftools> (default: bcftools or auto-installation)
  --bgenix_exe   <path/to/bgenix>   (default:  bgenix  or auto-installation)
  --plink2_exe   <path/to/plink2>   (default:  plink2  or auto-installation)

  --scratch_folder (default: scratch/)

  --output_fnm (default: my_results.txt)

  --threads   (default:    1)
  --memory_mb (default: 8000)
"
  • --geno_files: .vcf[.gz]/.bcf[.gz], gds, .bgen, .bed+.bim+.fam, or .pgen+.pvar+.psam file formats.
    • Can specify (multiple) files, or patterns e.g. data/*.bgen.
    • VCF files can be accessed directly through URLs. See examples.
    • --sample_file: an accompanying .sample file is required if using .bgen file(s).
    • If there are variants with duplicate chr,pos,ref,alt in your genotype data, the first will be chosen.
  • --score_file Must contain columns for chromosome, position, reference allele, alternate allele, effect allele, and at least one column of weights.
  • This pipline automatically replaces score_file variants missing from your geno_files with suitable proxies from 1000 Genoems, based on the following parameters:
    • --proxy_r2_cutoff: The minimum R^2 correlation that proxies are allowed to have.
    • --proxy_winsize_kb: The larger the window size, the longer it will take to calculate proxies.
    • --proxy_sample_pop: See here the list of accepted codes.
  • --allow_allele_flips Allows variants in your score file to have ambiguous ref/alt allele. Useful if you only know the effect allele and non-effect alleles.

Dependencies

  • R (>=4.1)
    • Packages: install.packages(c("data.table","LDlinkR","parallel","XML")

Other dependencies are automatically installed as needed. (They will be installed to the current directory. It is expected your system has basic utilities like curl and make to download and build the needed software). If you plan to run this pipeline repeatedly on the cloud or on a compute cluster, consider using an environment with these additional dependencies pre-installed so you don't waste time installing them each run:

References

TODO

  • Small examples for all file types
  • Tell Broad cluster users to use GCC-5.2 or use Bcftools if bcftools fails to build. Or give general advice to enterprise linux users that they'll need gcc>=5.2.

About

Paritioned Polygenic Risk Scores pipeline: essentially a wrapper around `plink --score` which automatically finds proxies for missing variants.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors