Mapping genome sequence to a reference genome on Eddie using bwa software package
Mapping the sequenced genome of the study to the reference genome provides insights into the structural variants which is critical for understanding the evolution process. The reference genome features the chromosomes and position where the sequenced data originated. Mapping is a process of matching the sequence data to the specific chromosome. This provides a clear understanding of which region and gene a read belongs to, the exact chromosomes, and discovers where there are repetitive regions. Another significant aspect of mapping is that it provides a clear understanding of structural variations. In addition, it is important to align our sequenced data to the reference genome for variant calling using tools such as samtools, GATK, and others which is vital for estimating the demographic model in this study. Moving on now there are several tools available for mapping, many studies have utilized bwa to index and map the sequence data to reference genomes. This present study utilizes Burrow Wheeler Analysis (BWA) to index the reference genome of Atlantic salmon and map the sequence data to estimate the effective population size of Atlantic salmon to the reference genome. The BWA software package process is classified into bwa index and bwa mem.
The sequence data used for this study is a paired-end read of an Atlantic salmon from North America: https://www.ebi.ac.uk/ena/browser/view/SRR28213514. And this was mapped to a reference genome from this source: https://www.ebi.ac.uk/ena/browser/view/GCA_905237065.2
Indexing data makes it easier and safe to align the sequence data to a large reference genome structure. This reduce the time it takes to search through the whole genome every time it has to align the sequenced data. Bwa index provides an efficient means of aligning the sequence data to the reference genome. The BWA index runs generates extensions like .amb, .ann, .bwt, .pac. and .sa files required for efficient alignment. The process entails reconstruction of FASTA file into the Burrow-Wheeler Transform (BWT) related files, thus the genome sequence was converted to a compressed format that optimized the searching process and enabled efficiency.
The BWA mem runs make use of the original ref. Genome fasta file because it’s the only one with the actual sequence file but retrieves the associated index files from the directory to generate an alignment file during the mapping process. The mapping process is particularly essential to population genetic study and provides insight into the sequence data of the study. This process produced a Sequence Alignment/Map (SAM) file format which is very large as an output.