Skip to content

Full Genome Imputation

claussian edited this page Feb 9, 2014 · 8 revisions

Welcome to the MalariaImpute wiki!

IMPUTEv2 for Haploid Malaria Sequences

IMPUTE requires 5 inputs for an imputation run:

  1. Panel to be imputed
  2. Reference panel (no missingness allowed)
  3. Recombination rates file
  4. Chromosome marker file
  5. Samples file

IMPUTE was designed for array-based platforms, in which missingness would be systematic across all the samples. However, missingness in malaria sequences is sporadic. This forces us to execute an IMPUTE run for each sequence individually with its own custom reference panel (using all other sequences as references) and custom samples file.

Imputation Panels

Scripts beginning with haploid_converter_ takes the set of sequences and spits out an IMPUTE-ready imputation panel for every sequence. One script takes care of each population.

Reference Panels

Scripts beginning with hapoid_referencepaneller_ takes the set of sequences and spits out an IMPUTE-ready reference panel for every sequence. For 590 sequences across five populations, this gives us 590x14=8260 reference panels for the full genome (!). One script takes care of each population.

Samples file

The script haploid_samples_all_fullscan.R spits out an IMPUTE-ready samples (isolate ID) file for every sequence. Samples files are common across chromosomes.

Recombination rate and marker files

These have been pre-formatted on my desktop. One rates file and one marker file for each chromosome.

IMPUTE run

The script imputeloopN_fullscan_writer.R spits out 14 bash scripts for 14 chromosomes, each one executing IMPUTE runs for 590 imputation panels across five populations (590x14=8260 jobs).

Beagle for Haploid Malaria Sequences

Unlike IMPUTE, the structure of Beagle is very amenable to imputation of sporadically missing sequences. Only 2 inputs are required:

  1. Sequences file, containing all sequences to be imputed
  2. Markers file for the chromosome

Sequences File

The script beagleformatter_fullscan.R spits out a formatted sequence file for each population. Beagle runs are run separately for each population since we have determined that cosmopolitan referencing performs terribly in Beagle. However, the sequence file is not Beagle-ready yet.

Diploid Conversion

Beagle only accepts diploid genotypes for imputation. The script diploid_converter_1s_fullscan.py converts the haploid sequences into homozygous diploid, after which it is Beagle-ready.

Markers file

These have been pre-formatted on my desktop. One marker file for each chromosome.

Beagle run

The script beagleloop_fullscan.sh executes a Beagle job for each chromosome in each population (14x5=70 jobs).

Reappending imputed genotypes

The imputed genotypes are in 3-column per genotype files corresponding to homozygous major, heterozygous, and homozygous minor probabilities. We sum the imputed major allele probability with 0.5 x heterozygous probability for each genotype and append these together into a single file. The script mask_reappender_gprobs_fullscan_writer.R spits out 14 R scripts which does this for 14 chromosomes.