This is a methods section for GWAS study using 300 Kazakh genomes.
I) Information about sequencing
- 296 Samples, 766221 SNPs were genotyped
- Platform: GSA MG v2
- Scan protocol: Standard Illumina procedures using Illumina iScan scanner
- Reference genome: GRCh38 Genome Reference Consortium Human Build 38
II) Raw data processing
Normalized signal intensity and genotype were computed using Illumina’s GenomeStudio v.2 software.
- Clustering
- Make genotype calls across all samples using a standard Infinium Bead Chip cluster file.
- The standard cluster file (*egt file) supplied by Illumina for each Infinium BeadChip type is generated using a diverse set of more than 200 HapMap1DNA samples in an Illumina laboratory. Some SNP probes (include custom probes) were clustered manually for reduce the number of spurious region calls, and increase the accuracy of the results. custom cluster tech note
- Callrate check
- Quality checking by plotting p10 GC and sample call rate.
- If a sample passes the intact DNA sample QC criteria but when the callrate is significantly lower than other samples, then the sample can be re-experiment is some cases. Call rates were consistently high in the experiment; no samples were excluded at this stage
- Call rate, p10 GC, GenCall Score are available in phenotypes.tsv
- Call Rate: Percentage of SNPs (expressed as a decimal) whose GenCall score is greater than the specified threshold.
- p10 GC: 10th percentile GenCall score over all SNPs for this sample.
- GenCall Score_: quality metric that indicates the reliability of each genotype call.
-
Genotype matrix export Make a text file that contains the genotype of entire samples and probes.
-
Make input files for third party tools: *.ped & *.map files to execute PLINK.
- HB00001157.ped, HB00001157.map were made available by sequencing provider
- phenotypes.tsv file that contained patient information was provided separately
III) Data processing in PLINK
- prepared .map file was used for further analysis
- prepared .ped file needed further processing to include phenotypes (analysis shown in plink_qc1.md)
- prepared .ped and .map files were processed for quality control (analysis shown in plink_qc2.md)
- PCA, ROH, Fst and ADMIXTURE analyses were performed using other populations from Human Genome Diversity Project (HGDP) (analysis shown in plink_HGDP.md)