This code is designed to take fastq files that are produced by the MiSeq and return integration sites, multihits, and chimeras. RData and fasta files are used for data persistance in place of a database, thus allowing massive parallelization and the LSF job submission system on the PMACS HPC
Analysis is started by having the user create the following directory structure:
primaryAnalysisDirectory
├── Data
│ ├── Undetermined_S0_L001_I1_001.fastq.gz
│ ├── Undetermined_S0_L001_R1_001.fastq.gz
│ └── Undetermined_S0_L001_R2_001.fastq.gz
├── processingParams.csv
└── sampleInfo.csv
-
Data/Undetermined_S0_L001_*_001.fastq.gzare the fastq files returned by the MiSeq (R1, R2, and I1) -
processingParams.csvcontains 'dryside' sample metadata:aliasis the human-readable sample descriptionqualityThreshold, badQualityBases, qualitySlidingWindow- After error-correcting and demultiplexing, intSiteCaller trims raw MiSeq reads based on Illumina-issued quality scores.
badQualityBasesis the number of bases belowqualityThreshold, using standard Illumina ASCII Q-score encoding (p.41-2), that can be observed in a window ofqualitySlidingWindowbefore the read is trimmed.
- After error-correcting and demultiplexing, intSiteCaller trims raw MiSeq reads based on Illumina-issued quality scores.
primeris the primer sequence as seen in MiSeq read 2ltrBitis the LTR sequence as seen in MiSeq read 2largeLTRFragis 43nt of the LTR sequence as seen from MiSeq read 1linkerCommonis 15nt of the linker sequence as seen from MiSeq read 2mingDNAis the minimum length of genomic DNA (in nt) that is allowed to be passed onto the alignment stepvectorSeqis a filepath (either absolute or relative to the primary analysis directory) to the vector sequence in fasta format -- it is encouraged to place the vector sequence directly in the primary analysis directory, although that is not a requirementminPctIdentis the minimum percent identity that a query sequence has to a putative alignment on the target sequence in order to be considered 'valid'maxAlignStartis the maximum number of nucleotides of the query sequence that do not match the putative target sequence before a matched nucleotide is seenmaxFragLengthis the maximum length of a properly-paired alignment (in nt)refGenomeis the reference genome to be used for alignment - this is passed in as a standard text string (ex. 'hg18', 'hg19', 'mm8', 'mm9')
-
sampleInfo.csvcontains 'wetside' sample metadata:aliasis the human-readable sample descriptionlinkerSequenceis the linker sequence as seen in MiSeq read 1. N's indicate the presence of a primerIDbcSeqis the barcode used during sample preparationgenderis either 'M' of 'F' for male/female, respectively
After creating the above directory structure, the following command is issued:
Rscript intSiteCaller.R
The rest of the processing is fully automated and shouldn't take more than 4 hours to process 1.5e7 raw reads.
intSiteCaller.R can handle the following options
-j,--jobID- Unique name by which to identify this intance of intSiteCaller [default: intSiteCallerJob]-c,--codeDir- Directory where intSiteCaller code is stored, can be relative or absolute [default: .]-p,--primaryAnalysisDir- Location of primary analysis directory, can be relative or absolute [default: .]-C,--cleanup- Remove temporary files upon successful execution of intSiteCaller-h,--help- Show the help message and exit
This code returns integration sites in two formats. allSites.RData is a GRanges object that contains a single record for each Illumina read. sites.final.RData is a GRanges object of dereplicated integration sites along with a tally of how many reads were seen for each site (irregardless of sonic breakpoint). The revmap column in sites.final.RData links unique reads from allSites to dereplicated sites in sites.final.
Multihits are stored in multihitData.RData which is a list of three items:
unclusteredMultihitsis aGRangesobject of raw multihits, directly fromproperlyPairedAlignmentsclusteredMultihitPositionsis aGRangesListobject of clustered solostart multihits with each includedGRangesobject representing one multihit clusterclusteredMultihitLengthsis alistobject containing dataframes of multihit 'fragment lengths' for use in sonicAbundance. Each list record, n, cooresponds to the nth cluster as defined inclusteredMultihitPositions. TheVar1column cooresponds to the 'length' and theFreqcolumn cooresponds to the number of times that fragment length was observed.
Chimeras are stored in chimeraData.RData which is a list that contains some basic chimera frequency statistics and a GRangesList object. Each GRanges object contains two records, one for the read1 alignment and another for the read2 alignment
PrimerIDs (if present in the linker sequence) are stored in primerIDData.RData. This file is a base R list containing a DNAStringSet and a BStringSet containing the sequences and quality scores of the primerIDs.
Processing statistics are returned in the stats.RData file. This file contains a single data.frame. Detail of the specific columns provided in this dataframe will be added later and can inferred from intSiteLogic.R.
This code is highly dependent on Bioconductor packages for processing DNA data and collapsing/expanding alignments.
The following R packages and their subsesequent dependencies are required for proper operation of intSiteCaller:
ShortReadhiReadsProcessorGenomicRangesrtracklayerBSgenomeargparseigraph- Any
BSgenome.*.UCSC.*package cooresponding to reference genomes specified inprocessingParams.csv
Specific versioning analysis has not yet been performed.
Additionally, BLAT code requires the availability of the blat and python command. blat is available on PMACS at /opt/software/blatSrc/v35/bin/x86_64, however is not included in the default PATH. This directory will need to be added to the user's default PATH in order to successfullly run BLAT alignments.
intSiteCaller confirms the presence of all dependancies and will throw an error if a dependancy is not met.
- Primary read trimming and integration site calling logic is contained in
intSiteLogic.R. - Branching and condensing of PMACS jobs is performed in
programFlow.R - Barcode error correcting logic is performed in
errorCorrectIndices/golay.pyas wrapped byerrorCorrectIndices/processGolay.py. - All code files are checked into the repository.
- Flowcharts will be added to graphically describe the flow of the overall program as well as the flow/logic of individual functions
A sample dataset is included for verification of integration site calling accuracy. The testCases directory contains a subdirectory, intSiteValidation, and a compressed folder, intSiteValidationOUTPUT.tar.gz. To analyze the test data, move intSiteValidation/* to an LSF envrionment and execute the code as described in the 'Usage' section. Upon successful execution, the directory should resemble intSiteValidationOUTPUT.tar.gz. Note that this subset of data contains samples with some borderline cases. For example, clone7 samples should all fail, and many of the clone1-clone4 samples should return no multihits or chimeras. The current implementation of the code handles these gracefully.