-
Notifications
You must be signed in to change notification settings - Fork 3
GRASPER: Genome Rearrangement Analysis using Short Paired-End Reads
License
COL-IU/GRASPER
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
GRASPER
Heewook Lee
heewlee@indiana.edu
--------------------------
SUMMARY
--------------------------
GRASPER (Genome Rearrangement Analysis using Short Paired-End Reads) is a de novo structural variation (SV) calling software that is capable of detecting repetitive SVs.
It uses (BLAST to A-Bruijn program) to construct A-Bruijn graphs of a given refernece genome to capture approximate repeats (e.g. 95% sequence similarity or higher), then SVs are detected on the graphs.
GRASPER requires a reference genome sequence in a FASTA formatted file along with a Illumina paired-end sequencing data of a sample genome.
Currently, it supports
1) Duplicative transposition
2) Deletion of non-repetitive region
3) Deletion of repetitive region
4) Deletion of non-repetitive region bounded by repeats (via homologous recombination)
5) Inversion
6) Tandem-duplication
Unsupported events are still reported in the form of breakpoints. GRASPER first calls breakpoints then assign SV events based on the well known paired SV signatures along with read-depth information. Any breakpoint event without a SV event assignment is reported separately.
--------------------------
Requirements
--------------------------
To build and run GRASPER, the following are required:
- JDK 1.6 or higher
- Unix-like OS (Linux, Mac OS X, ... )
- Legacy BLAST (available from ftp://ftp.ncbi.nlm.nih.gov/blast/executables/release/LATEST/ , more information on https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastDocs&DOC_TYPE=Download) We used version 2.2.25 which can be downloaded from ( ftp://ftp.ncbi.nlm.nih.gov/blast/executables/release/2.2.25/ )
- Burrows-Wheeler Aligner by Heng Li (version 0.7.9 or higher)
- BLAST to A-Bruijn graph package (available from https://github.com/COL-IU/RepGraph )
- Illumina or Illumina-like paired-end reads (whole-genome sequencing)
- a reference genome sequence
- as of v0.1.1, .medMAD file is generated AUTOMATICALLY from RepGraph (v 0.1.1). This file contains meadian and Median Absolute Deviation (MAD) values for library insert size. 1 SD ~ 1.4826 MAD (https://en.wikipedia.org/wiki/Median_absolute_deviation) under normal distribution. This file contains single line of 2 values delimited by a tab.
-------------------------
Installation
-------------------------
After downloading the GRASPER source distribution and unpacking it, change into the top-level directory:
> cd grasper
Then, compile and create .jar files
> make
This will create a new directory "bin" under the grasper directory with the following jar file:
grasper.jar
-------------------------
Config file
-------------------------
Configuration file contains parameters that GRASPER/RepGraph/BLAST/bwa need.
An example configuration file can be found in "test_data" directory.
-------------------------
How to run
-------------------------
Although grasper can run as a stand-along program, it first needs A-Bruijn graph representation of reference genome which is generated by RepGraph package as well as SAM formatted alignment of paired-end reads. For this reason, grasper.sh is provided to tie all these dependencies together in a single script.
Here are the list of commands when running on test_data
1. Move into test_case directory under GRASPER directory
> cd <GRASPER_INSTALLATION_DIR>/test_data
2. Indexing for BLAST and bwa (ONLY needs to be run once for a reference genome)
> ../grasper.sh I example_config.txt
3. Run pair-wise BLASTN on a given reference genome and construct A-Bruijn graphs (ONLY needs to be run once for a reference genome)
> ../grasper.sh G example_config.txt
4. Align via BWA
> ../grasper.sh A example_config.txt 20Insertions_per_element_1TH_pIRS_20X_11_90_470_1.fq.gz 20Insertions_per_element_1TH_pIRS_20X_11_90_470_2.fq.gz
5.Depth Serialization, mid-sroting, discordant pair removal, SV detection
> ./grasper.sh DS example_config.txt
Note that command ADS can be run separately or combined all together. run grasper.sh without any parameters to see more explanation.
> ./grasper.sh
Screen dump of running on test_data can be found on test_data/test_data.screendump
------------------------
OUTPUT
------------------------
*.thread : A-Bruijn graphs threading information
*.depth : .depth file contains the serialization of depth arrays.
*.discordant.midsorted : midpoint-sorted SAM file containing only the discordant mappings
*.SV : this file contains the SV calls from GRASPER
-----------------------
.SV file
-----------------------
2 breakpoint events (TRANSPOSITION or INVERSION) have 23 columns and 1 breakpoint events only have the first 13 columns
*** COLUMNS ***
Column 1 : Event ( (I) means inverted )
Column 2 : event classifier (internal purpose)
Column 3/5/20/22 : These columns indicate #reads in cluster
Column 4/6/21/23 : These columns indicate # of instances these clusters can map on linear reference. Clusters on graph that are on repetitive paths will have numbers > 1 to indicate their multiplicities.
Column ( 7-8-9 / 10-11-12 / 14-15-16 / 17-18-19 ) : One triplet indicates 5'boundary-3'boundary-ClusteringDirection of a cluster of reads
Column 3-4-7-8-9 indicates single cluster (meaning the boundary and direction is described by columns 7-8-9 and #reads and multiplicity information of this clusters are in columns 3-4.)
Columm 5-6-10-11-12 indicates single cluster.
Column 20-21-14-15-16 indicates single cluster.
Column 22-23-17-18-19 indicates single cluster.
Clusters that cannot be assigned to a specific event are appended at the end under "# UNASSIGNED CLUSTERS" section.
**** Event Boundaries ***
1) Deletion: Deletion boundaries are roughly defined by [column8, column10] (Direction of clusters : --> <--)
2) Inversion: Inversion boundaries are roughly defined by [column8/column10 , column15/column16]
3) Transposition: Segment that is being transposed is roughly defined by [column3, column7] (<--- --->) and it's being transposed to the target location, roughly around column15/column16 (---> <---). A midpoint of column 15 and column16 is probably a resonable guess.
4) Tandem duplication: Segment that is being tandemly duplicated is roughly defined by [column7, column11] (Direction: <--- --->)
About
GRASPER: Genome Rearrangement Analysis using Short Paired-End Reads
Resources
License
Stars
Watchers
Forks
Packages 0
No packages published