From 6ee600e1077b0cb98e04b7e8c2988f09fa96cd55 Mon Sep 17 00:00:00 2001 From: Matthew Taliaferro Date: Tue, 9 Nov 2021 15:57:30 -0700 Subject: [PATCH 001/108] add README --- README.md | 68 +++++++++++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 66 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index c824098..f09e4c0 100644 --- a/README.md +++ b/README.md @@ -1,2 +1,66 @@ -# OINC-seq -Detecting oxidative marks on RNA through high-throughput sequencing +# OINC-seq

Detecting oxidative marks on RNA using high-throughput sequencing + +## Overview + +OINC-seq (Oxidation-Induced Nucleotide Conversion sequencing) is a sequencing technology that allows the direction of oxidative marks on RNA molecules. Because guanosine has the lowest redox potential of any of the ribonucleosides, it is the one most likely to be affected by oxidation. When this occurs, guanosine is turned into 8-oxoguanosine (8-OG). A previous [study](https://pubs.acs.org/doi/10.1021/acs.biochem.7b00730) found that when reverse transcriptase encounters guanosine oxidation products, it can misinterpret 8-OG as either T or C. Therefore, to detect these oxidative marks, one can look for G -> T and G -> C conversions in RNAseq data. + +To detect and quantify these conversions, we have created software called **PIGPEN** (Pipeline for Identification of Guanosine Positions Erroneously Notated). + +PIGPEN takes in alignment files (bam), ideally made with [STAR](https://github.com/alexdobin/STAR). Single and paired-end reads are supported, although paired-end reads are preferred (for reasons that will become clear later). To minimize the contribution of positions that appear as mutations due to non-ideal alignments, PIGPEN only considers uniquely aligned reads (mapping quality == 255). For now, it is required that paired-end reads be stranded, and that read 1 correspond to the sense strand. This is true for most, but not all, modern RNAseq library preparation protocols. + +## Requirements + +PIGPEN has the following prerequisites: +- python >= 3.6 +- samtools >= 1.13 +- varscan >= 2.4.4 +- bcftools >= 1.13 +- pybedtools >= 0.8.2 +- pysam >= 0.16 +- numpy >= 1.21 +- pandas >= 1.3.3 +- bamtools >= 2.5.1 +- bedtools >= 2.30.0 + +## Installation + +For now, installation can be done by cloning this repository. As PIPGEN matures, we will work towards getting this package on [bioconda](https://bioconda.github.io/). + +## SNPs + +8-OG-induced conversions are rare, and this rarity makes it imperative that contributions from conversions that are not due to oxidation are minimized. A major source of apparent conversions is SNPs. It is therefore advantageous to find and mask SNPs in the data. + +PIGPEN performs this by using [varscan](http://varscan.sourceforge.net/using-varscan.html) to find SNP positions. These locations are then excluded from all future analyses. Varscan parameters are controled by the PIGPEN parameters `--SNPcoverage` and `--SNPfreq` that control the depth and frequency required to call a SNP. We recommend being aggressive with these parameters. We often set them to 20 and 0.02, respectively. + +PIGPEN performs this SNP calling on control alignment files (`--controlBams`) in which the intended oxidation did not occur. PIGPEN will use the union of all SNPs found in these files for masking. Whether or not to call SNPs at all (you definitely should) is controlled by `--useSNPs`. + +This process can be time consuming. At the end, a file called **merged.vcf** is created in the current working directory. If this file is present, PIGPEN will assume that it should be used for SNP masking, allowing the process of identifying SNPs to be skipped. + +## Filtering alignments + +Because the process of finding nucleotide conversions can take a long time, PIGPEN first filters the reads, keeping only those that overlap with any feature in a supplied bed file (`--geneBed`). This process can use multiple processors (`--nproc`) to speed it up, and requires another file (`--chromsizes`). This file is a 2 column, tab-delimited text file where column 1 is the reference (chromosome) names for the references present in the alignment file, and column 2 is the integer size of that reference. If a fasta file and fasta index for the genome exists, this file can be made using `cut -f 1,2 genome.fa.fai`. + +## Quantifying conversions + +PIGPEN then identifies conversions in reads. This can be done using multiple processors (`--nproc`). In order to minimize the effect of sequencing error, PIGPEN only considers positions for which the sequencing quality was at least 30. There are two important flags to consider here. + +First, `--onlyConsiderOverlap` requires that the same conversion be observed in both reads of a mate pair. Positions interrogated by only one read are not considered. This can improve accuracy. True oxidation-induced conversions are rare. Rare enough that sequencing errors can cause a problem. Requiring that a conversion be present in both reads minimizes the effect of sequencing errors. If the fragment sizes for a library are especially large relative to the read length, the number of positions interrogated by both mates will be small. + +Second, `--requireMultipleConv` requires that there be at least two G -> C / G -> T conversions in a read pair in order for those conversions to be recorded. The rationale here is again to reduce the contribution of background, non-oxidation-related conversions. Background conversions should be distributed relatively randomly across reads. However, due to the spatial nature of the oxidation reaction, oxidation-induced conversions should be more clustered into specific reads. Therefore, requiring at least two conversions can increase specificity. In practice, this works well if the data is very deep or concentrated on a small number of targets. When dealing with transcriptome-scale data, this flag often reduces the number of observed conversions to an unacceptably low level. + +## Assigning reads to genes + +For now, PIGPEN using `bedtools` and a supplied bed file of gene locations (`--geneBed`) to assign individual reads to genes. We are working on improvements in this area. + +## Calculating the number of conversions per gene + +After identifying the conversions present in each read and the cognate gene for each read, the number of conversions for each gene is calculated. We have observed that the overall rate of conversions (not just G -> T + G -> C, but all conversions) can vary signficantly from sample to sample, presumably due to a technical effect in library preparation. For this reason, PIGPEN calculates **PORC** (Proportion of Relevant Conversions) values. This is the log2 ratio of the relevant conversion rate ([G -> T + G -> C] / total number of reference G encountered) to the overall conversion rate (total number of all conversions / total number of positions interrogated). PORC therefore normalizes to the overall rate of conversions, removing this technical effect. + +## Statistical framework for comparing gene-level PORC values across conditions + +We are working on this. + + + + + From df7207efef6f0be1e7dafaf0e68f36a936751401 Mon Sep 17 00:00:00 2001 From: Matthew Taliaferro Date: Wed, 10 Nov 2021 10:49:44 -0700 Subject: [PATCH 002/108] add ascii art --- README.md | 14 ++++++++++++++ pigpen.py | 10 +++++++++- 2 files changed, 23 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index f09e4c0..981d9ac 100644 --- a/README.md +++ b/README.md @@ -8,6 +8,20 @@ To detect and quantify these conversions, we have created software called **PIGP PIGPEN takes in alignment files (bam), ideally made with [STAR](https://github.com/alexdobin/STAR). Single and paired-end reads are supported, although paired-end reads are preferred (for reasons that will become clear later). To minimize the contribution of positions that appear as mutations due to non-ideal alignments, PIGPEN only considers uniquely aligned reads (mapping quality == 255). For now, it is required that paired-end reads be stranded, and that read 1 correspond to the sense strand. This is true for most, but not all, modern RNAseq library preparation protocols. + ,-,-----, + PIGPEN **** \ \ ),)`-' + <`--'> \ \` + /. . `-----, + OINC! > ('') , @~ + `-._, ___ / +-|-|-|-|-|-|-|-| (( / (( / -|-|-| +|-|-|-|-|-|-|-|- ''' ''' -|-|-|- +-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-| + + Pipeline for Identification + Of Guanosine Positions + Erroneously Notated + ## Requirements PIGPEN has the following prerequisites: diff --git a/pigpen.py b/pigpen.py index c222157..010f0d4 100644 --- a/pigpen.py +++ b/pigpen.py @@ -9,9 +9,10 @@ from getmismatches import iteratereads_pairedend, getmismatches from assignreads import getReadOverlaps, processOverlaps from conversionsPerGene import getPerGene, writeConvsPerGene +import pickle if __name__ == '__main__': - parser = argparse.ArgumentParser(description = 'pigpen for quantifying OINC-seq data') + parser = argparse.ArgumentParser(description=' ,-,-----,\n PIGPEN **** \\ \\ ),)`-\'\n <`--\'> \\ \\` \n /. . `-----,\n OINC! > (\'\') , @~\n `-._, ___ /\n-|-|-|-|-|-|-|-| (( / (( / -|-|-| \n|-|-|-|-|-|-|-|- \'\'\' \'\'\' -|-|-|-\n-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|\n\n Pipeline for Identification \n Of Guanosine Positions\n Erroneously Notated', formatter_class = argparse.RawDescriptionHelpFormatter) parser.add_argument('--bam', type = str, help = 'Aligned reads (ideally STAR uniquely aligned reads) to quantify', required = True) parser.add_argument('--controlBams', type = str, help = 'Comma separated list of alignments from control samples (i.e. those where no *induced* conversions are expected. Required if SNPs are to be considered.') @@ -74,6 +75,13 @@ overlaps, numpairs = getReadOverlaps(filteredbam, args.geneBed, args.chromsizes) read2gene = processOverlaps(overlaps, numpairs) + #TESTING FOR SUBSAMPLING READS TO GET P VALUES + samplename = os.path.basename(args.bam) + with open(samplename + '.read2gene.pkl', 'wb') as outfh: + pickle.dump(read2gene, outfh) + with open(samplename + '.readconvs.pkl', 'wb') as outfh: + pickle.dump(convs, outfh) + #Calculate number of conversions per gene numreadspergene, convsPerGene = getPerGene(convs, read2gene) writeConvsPerGene(numreadspergene, convsPerGene, args.output) From 5674e8dc4144758bcb62f0a54a1afe1a5ce20a66 Mon Sep 17 00:00:00 2001 From: Matthew Taliaferro Date: Wed, 10 Nov 2021 10:51:50 -0700 Subject: [PATCH 003/108] update ascii again --- README.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/README.md b/README.md index 981d9ac..25d1082 100644 --- a/README.md +++ b/README.md @@ -14,13 +14,13 @@ PIGPEN takes in alignment files (bam), ideally made with [STAR](https://github.c /. . `-----, OINC! > ('') , @~ `-._, ___ / --|-|-|-|-|-|-|-| (( / (( / -|-|-| +\-|-|-|-|-|-|-|-| (( / (( / -|-|-| |-|-|-|-|-|-|-|- ''' ''' -|-|-|- --|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-| +\-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-| - Pipeline for Identification - Of Guanosine Positions - Erroneously Notated + Pipeline for Identification
+ Of Guanosine Positions
+ Erroneously Notated
## Requirements From b8b7ab1b326279bcef82c7e96554c5f3e9081ec9 Mon Sep 17 00:00:00 2001 From: Matthew Taliaferro Date: Wed, 10 Nov 2021 10:59:28 -0700 Subject: [PATCH 004/108] one more time for the ascii --- README.md | 38 +++++++++++++++++--------------------- 1 file changed, 17 insertions(+), 21 deletions(-) diff --git a/README.md b/README.md index 25d1082..ccf9699 100644 --- a/README.md +++ b/README.md @@ -4,27 +4,28 @@ OINC-seq (Oxidation-Induced Nucleotide Conversion sequencing) is a sequencing technology that allows the direction of oxidative marks on RNA molecules. Because guanosine has the lowest redox potential of any of the ribonucleosides, it is the one most likely to be affected by oxidation. When this occurs, guanosine is turned into 8-oxoguanosine (8-OG). A previous [study](https://pubs.acs.org/doi/10.1021/acs.biochem.7b00730) found that when reverse transcriptase encounters guanosine oxidation products, it can misinterpret 8-OG as either T or C. Therefore, to detect these oxidative marks, one can look for G -> T and G -> C conversions in RNAseq data. -To detect and quantify these conversions, we have created software called **PIGPEN** (Pipeline for Identification of Guanosine Positions Erroneously Notated). +To detect and quantify these conversions, we have created software called **PIGPEN** (Pipeline for Identification of Guanosine Positions Erroneously Notated). PIGPEN takes in alignment files (bam), ideally made with [STAR](https://github.com/alexdobin/STAR). Single and paired-end reads are supported, although paired-end reads are preferred (for reasons that will become clear later). To minimize the contribution of positions that appear as mutations due to non-ideal alignments, PIGPEN only considers uniquely aligned reads (mapping quality == 255). For now, it is required that paired-end reads be stranded, and that read 1 correspond to the sense strand. This is true for most, but not all, modern RNAseq library preparation protocols. - ,-,-----, - PIGPEN **** \ \ ),)`-' - <`--'> \ \` - /. . `-----, - OINC! > ('') , @~ - `-._, ___ / -\-|-|-|-|-|-|-|-| (( / (( / -|-|-| -|-|-|-|-|-|-|-|- ''' ''' -|-|-|- -\-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-| + ,-,-----, + PIGPEN **** \ \ ),)`-' + <`--'> \ \` + /. . `-----, + OINC! > ('') , @~ + `-._, ___ / + -|-|-|-|-|-|-|-| (( / (( / -|-|-| + |-|-|-|-|-|-|-|- ''' ''' -|-|-|- + -|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-| - Pipeline for Identification
- Of Guanosine Positions
- Erroneously Notated
+ Pipeline for Identification + Of Guanosine Positions + Erroneously Notated ## Requirements PIGPEN has the following prerequisites: + - python >= 3.6 - samtools >= 1.13 - varscan >= 2.4.4 @@ -38,15 +39,15 @@ PIGPEN has the following prerequisites: ## Installation -For now, installation can be done by cloning this repository. As PIPGEN matures, we will work towards getting this package on [bioconda](https://bioconda.github.io/). +For now, installation can be done by cloning this repository. As PIPGEN matures, we will work towards getting this package on [bioconda](https://bioconda.github.io/). ## SNPs -8-OG-induced conversions are rare, and this rarity makes it imperative that contributions from conversions that are not due to oxidation are minimized. A major source of apparent conversions is SNPs. It is therefore advantageous to find and mask SNPs in the data. +8-OG-induced conversions are rare, and this rarity makes it imperative that contributions from conversions that are not due to oxidation are minimized. A major source of apparent conversions is SNPs. It is therefore advantageous to find and mask SNPs in the data. PIGPEN performs this by using [varscan](http://varscan.sourceforge.net/using-varscan.html) to find SNP positions. These locations are then excluded from all future analyses. Varscan parameters are controled by the PIGPEN parameters `--SNPcoverage` and `--SNPfreq` that control the depth and frequency required to call a SNP. We recommend being aggressive with these parameters. We often set them to 20 and 0.02, respectively. -PIGPEN performs this SNP calling on control alignment files (`--controlBams`) in which the intended oxidation did not occur. PIGPEN will use the union of all SNPs found in these files for masking. Whether or not to call SNPs at all (you definitely should) is controlled by `--useSNPs`. +PIGPEN performs this SNP calling on control alignment files (`--controlBams`) in which the intended oxidation did not occur. PIGPEN will use the union of all SNPs found in these files for masking. Whether or not to call SNPs at all (you definitely should) is controlled by `--useSNPs`. This process can be time consuming. At the end, a file called **merged.vcf** is created in the current working directory. If this file is present, PIGPEN will assume that it should be used for SNP masking, allowing the process of identifying SNPs to be skipped. @@ -73,8 +74,3 @@ After identifying the conversions present in each read and the cognate gene for ## Statistical framework for comparing gene-level PORC values across conditions We are working on this. - - - - - From 9e668ad256b2d9d5d980c0daac30864678b94f3e Mon Sep 17 00:00:00 2001 From: Matthew Taliaferro Date: Wed, 12 Jan 2022 08:49:47 -0700 Subject: [PATCH 005/108] add look for g_t or g_c independently --- conversionsPerGene.py | 13 +++++++++++-- getmismatches.py | 43 ++++++++++++++++++++++++++++++------------- pigpen.py | 24 ++++++++++++------------ 3 files changed, 53 insertions(+), 27 deletions(-) diff --git a/conversionsPerGene.py b/conversionsPerGene.py index e914b59..493c1cd 100644 --- a/conversionsPerGene.py +++ b/conversionsPerGene.py @@ -51,7 +51,7 @@ def getPerGene(convs, reads2gene): return numreadspergene, convsPerGene -def writeConvsPerGene(numreadspergene, convsPerGene, outfile): +def writeConvsPerGene(numreadspergene, convsPerGene, outfile, use_g_t, use_g_c): possibleconvs = [ 'a_a', 'a_t', 'a_c', 'a_g', 'a_n', 'g_a', 'g_t', 'g_c', 'g_g', 'g_n', @@ -75,7 +75,16 @@ def writeConvsPerGene(numreadspergene, convsPerGene, outfile): convcounts = [str(x) for x in convcounts] totalG = c['g_g'] + c['g_c'] + c['g_t'] + c['g_a'] + c['g_n'] - convG = c['g_c'] + c['g_t'] + if use_g_t and use_g_c: + convG = c['g_c'] + c['g_t'] + elif use_g_c and not use_g_t: + convG = c['g_c'] + elif use_g_t and not use_g_c: + convG = c['g_t'] + elif not use_g_t and not use_g_c: + print('ERROR: we have to be counting either G->T or G->C, if not both!') + sys.exit() + g_ccount = c['g_c'] g_tcount = c['g_t'] diff --git a/getmismatches.py b/getmismatches.py index a792275..f7cd595 100644 --- a/getmismatches.py +++ b/getmismatches.py @@ -51,7 +51,7 @@ def iteratereads_singleend(bam, snps = None): #Check mapping quality #For nextgenmap, max mapq is 60 - if read.mapping_quality < 50: + if read.mapping_quality < 255: continue queryname = read.query_name @@ -175,7 +175,8 @@ def findsnps(controlbams, genomefasta, minCoverage = 20, minVarFreq = 0.02): return snps -def iteratereads_pairedend(bam, onlyConsiderOverlap, snps = None, requireMultipleConv = False, verbosity = 'high'): + +def iteratereads_pairedend(bam, onlyConsiderOverlap, use_g_t, use_g_c, snps=None, requireMultipleConv=False, verbosity='high'): #Iterate over reads in a paired end alignment file. #Find nt conversion locations for each read. #For locations interrogated by both mates of read pair, conversion must exist in both mates in order to count @@ -241,7 +242,7 @@ def iteratereads_pairedend(bam, onlyConsiderOverlap, snps = None, requireMultipl read1qualities = list(read1.query_qualities) #phred scores read2qualities = list(read2.query_qualities) - convs_in_read = getmismatches_pairedend(read1alignedpairs, read2alignedpairs, read1queryseq, read2queryseq, read1qualities, read2qualities, read1strand, read2strand, snplocations, onlyConsiderOverlap, requireMultipleConv) + convs_in_read = getmismatches_pairedend(read1alignedpairs, read2alignedpairs, read1queryseq, read2queryseq, read1qualities, read2qualities, read1strand, read2strand, snplocations, onlyConsiderOverlap, requireMultipleConv, use_g_t, use_g_c) queriednts.append(sum(convs_in_read.values())) convs[queryname] = convs_in_read @@ -253,7 +254,7 @@ def iteratereads_pairedend(bam, onlyConsiderOverlap, snps = None, requireMultipl return convs, readcounter -def getmismatches_pairedend(read1alignedpairs, read2alignedpairs, read1queryseq, read2queryseq, read1qualities, read2qualities, read1strand, read2strand, snplocations, onlyoverlap, requireMultipleConv): +def getmismatches_pairedend(read1alignedpairs, read2alignedpairs, read1queryseq, read2queryseq, read1qualities, read2qualities, read1strand, read2strand, snplocations, onlyoverlap, requireMultipleConv, use_g_t, use_g_c): #remove tuples that have None #These are either intronic or might have been soft-clipped #Tuples are (querypos, refpos, refsequence) @@ -413,12 +414,28 @@ def getmismatches_pairedend(read1alignedpairs, read2alignedpairs, read1queryseq, #if we are requiring there be multiple g_t or g_c, implement that here if requireMultipleConv: - if convs['g_t'] + convs['g_c'] >= 2: - pass - elif convs['g_t'] + convs['g_c'] < 2: - convs['g_t'] = 0 - convs['g_c'] = 0 - + if use_g_t and use_g_c: + if convs['g_t'] + convs['g_c'] >= 2: + pass + elif convs['g_t'] + convs['g_c'] < 2: + convs['g_t'] = 0 + convs['g_c'] = 0 + elif use_g_t and not use_g_c: + if convs['g_t'] >= 2: + pass + elif convs['g_t'] < 2: + convs['g_t'] = 0 + convs['g_c'] = 0 + elif use_g_c and not use_g_t: + if convs['g_c'] >= 2: + pass + elif convs['g_c'] < 2: + convs['g_c'] = 0 + convs['g_t'] = 0 + elif not use_g_t and not use_g_c: + print('ERROR: we have to be looking for at least either G->T or G->C if not both!!') + sys.exit() + return convs @@ -534,7 +551,7 @@ def split_bam(bam, nproc): return splitbams -def getmismatches(bam, onlyConsiderOverlap, snps, requireMultipleConv, nproc): +def getmismatches(bam, onlyConsiderOverlap, snps, requireMultipleConv, nproc, use_g_t, use_g_c): #Actually run the mismatch code (calling iteratereads_pairedend) #use multiprocessing #If there's only one processor, easier to use iteratereads_pairedend() directly. @@ -544,7 +561,7 @@ def getmismatches(bam, onlyConsiderOverlap, snps, requireMultipleConv, nproc): splitbams = split_bam(bam, int(nproc)) argslist = [] for x in splitbams: - argslist.append((x, bool(onlyConsiderOverlap), snps, bool(requireMultipleConv), 'low')) + argslist.append((x, bool(onlyConsiderOverlap), bool(use_g_t), bool(use_g_c), snps, bool(requireMultipleConv), 'low')) #items returned from iteratereads_pairedend are in a list, one per process totalreadcounter = 0 #number of reads across all the split bams @@ -591,4 +608,4 @@ def getmismatches(bam, onlyConsiderOverlap, snps, requireMultipleConv, nproc): numreadspergene, convsPerGene = getPerGene(convs, read2gene) writeConvsPerGene(numreadspergene, convsPerGene, sys.argv[4]) #output - \ No newline at end of file + diff --git a/pigpen.py b/pigpen.py index 010f0d4..e0d5ed1 100644 --- a/pigpen.py +++ b/pigpen.py @@ -4,6 +4,7 @@ import argparse import subprocess import os +import sys from snps import getSNPs, recordSNPs from filterbam import intersectreads, filterbam, intersectreads_multiprocess from getmismatches import iteratereads_pairedend, getmismatches @@ -13,8 +14,7 @@ if __name__ == '__main__': parser = argparse.ArgumentParser(description=' ,-,-----,\n PIGPEN **** \\ \\ ),)`-\'\n <`--\'> \\ \\` \n /. . `-----,\n OINC! > (\'\') , @~\n `-._, ___ /\n-|-|-|-|-|-|-|-| (( / (( / -|-|-| \n|-|-|-|-|-|-|-|- \'\'\' \'\'\' -|-|-|-\n-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|\n\n Pipeline for Identification \n Of Guanosine Positions\n Erroneously Notated', formatter_class = argparse.RawDescriptionHelpFormatter) - parser.add_argument('--bam', type = str, help = 'Aligned reads (ideally STAR uniquely aligned reads) to quantify', - required = True) + parser.add_argument('--bam', type = str, help = 'Aligned reads (ideally STAR uniquely aligned reads) to quantify', required = True) parser.add_argument('--controlBams', type = str, help = 'Comma separated list of alignments from control samples (i.e. those where no *induced* conversions are expected. Required if SNPs are to be considered.') parser.add_argument('--genomeFasta', type = str, help = 'Genome sequence in fasta format. Required if SNPs are to be considered.') parser.add_argument('--geneBed', type = str, help = 'Bed file of genomic regions to quantify. Fourth field must be gene ID.') @@ -25,9 +25,16 @@ parser.add_argument('--SNPcoverage', type = int, help = 'Minimum coverage to call SNPs. Default = 20', default = 20) parser.add_argument('--SNPfreq', type = float, help = 'Minimum variant frequency to call SNPs. Default = 0.02', default = 0.02) parser.add_argument('--onlyConsiderOverlap', action = 'store_true', help = 'Only consider conversions seen in both reads of a read pair?') + parser.add_argument('--use_g_t', action = 'store_true', help = 'Consider G->T conversions?') + parser.add_argument('--use_g_c', action = 'store_true', help = 'Consider G->C conversions?') parser.add_argument('--requireMultipleConv', action = 'store_true', help = 'Only consider conversions seen in reads with multiple G->C + G->T conversions?') args = parser.parse_args() + #We have to be either looking for G->T or G->C, if not both + if not args.use_g_t and not args.use_g_c: + print('We have to either be looking for G->T or G->C, if not both! Add argument --use_g_t and/or --use_g_c.') + sys.exit() + #Make index for bam if there isn't one already bamindex = args.bam + '.bai' if not os.path.exists(bamindex): @@ -65,9 +72,9 @@ #Identify conversions if args.nproc == 1: - convs, readcounter = iteratereads_pairedend(filteredbam, args.onlyConsiderOverlap, snps, args.requireMultipleConv, 'high') + convs, readcounter = iteratereads_pairedend(filteredbam, args.onlyConsiderOverlap, args.use_g_t, args.use_g_c, snps, args.requireMultipleConv, 'high') elif args.nproc > 1: - convs = getmismatches(filteredbam, args.onlyConsiderOverlap, snps, args.requireMultipleConv, args.nproc) + convs = getmismatches(filteredbam, args.onlyConsiderOverlap, snps, args.requireMultipleConv, args.nproc, args.use_g_t, args.use_g_c) #Assign reads to genes @@ -75,16 +82,9 @@ overlaps, numpairs = getReadOverlaps(filteredbam, args.geneBed, args.chromsizes) read2gene = processOverlaps(overlaps, numpairs) - #TESTING FOR SUBSAMPLING READS TO GET P VALUES - samplename = os.path.basename(args.bam) - with open(samplename + '.read2gene.pkl', 'wb') as outfh: - pickle.dump(read2gene, outfh) - with open(samplename + '.readconvs.pkl', 'wb') as outfh: - pickle.dump(convs, outfh) - #Calculate number of conversions per gene numreadspergene, convsPerGene = getPerGene(convs, read2gene) - writeConvsPerGene(numreadspergene, convsPerGene, args.output) + writeConvsPerGene(numreadspergene, convsPerGene, args.output, args.use_g_t, args.use_g_c) From c9041612a59e3e7c57ec73021fac467e15ac1e6b Mon Sep 17 00:00:00 2001 From: Matthew Taliaferro Date: Tue, 10 May 2022 13:56:21 -0600 Subject: [PATCH 006/108] Add GLM for bacon --- bacon_glm.py | 328 +++++++++++++++++++++++++++++++++++++++++++++++++++ pigpen.py | 2 + 2 files changed, 330 insertions(+) create mode 100644 bacon_glm.py diff --git a/bacon_glm.py b/bacon_glm.py new file mode 100644 index 0000000..5b41ee6 --- /dev/null +++ b/bacon_glm.py @@ -0,0 +1,328 @@ +#Bioinformatic Analysis of the Conversion Of Nucleotides (BACON) + +#Use a linear model to identify genes whose rate of G-conversions changes across conditions +#Going to end up with a lot of contingency tables: + +# converted notConverted +#G +#nonG + +#G conversions are only allowed to be G->T and G->C +#Compare groups of contingency tables across conditions +import pandas as pd +import sys +import numpy as np +import itertools +from statsmodels.stats.multitest import multipletests +from rpy2.robjects.packages import importr +from rpy2.robjects import pandas2ri, Formula, FloatVector +from rpy2.rinterface_lib.embedded import RRuntimeError +from rpy2.rinterface import RRuntimeWarning +from rpy2.rinterface_lib.callbacks import logger as rpy2_logger +import logging +import warnings + +#Need r-base, r-stats, r-lme4 + +#supress RRuntimeWarnings +warnings.filterwarnings('ignore', category = RRuntimeWarning) +rpy2_logger.setLevel(logging.ERROR) #suppresses R messages to console + + + +def readconditions(samp_conds_file): + #Read in a three column file of oincoutputfile / sample / condition + #Return a pandas df + sampconddf = pd.read_csv(samp_conds_file, sep = '\t', index_col = False, header = 0) + + return sampconddf + +def makePORCdf(samp_conds_file, minreads): + #Make a dataframe of PORC values for all samples + #start with GENE...SAMPLE...READCOUNT...PORC + #then make a wide version that is GENE...SAMPLE1READCOUNT...SAMPLE1PORC...SAMPLE2READCOUNT...SAMPLE2PORC + + #Only keep genes that are present in EVERY pigpen file + + minreads = int(minreads) + dfs = [] #list of dfs from individual pigpen outputs + genesinall = set() #genes present in every pigpen file + with open(samp_conds_file, 'r') as infh: + for line in infh: + line = line.strip().split('\t') + if line[0] == 'file': + continue + pigpenfile = line[0] + sample = line[1] + df = pd.read_csv(pigpenfile, sep = '\t', index_col = False, header = 0) + dfgenes = df['Gene'].tolist() + samplecolumn = [sample] * len(dfgenes) + df = df.assign(sample = samplecolumn) + + if not genesinall: #if there are no genes in there (this is the first file) + genesinall = set(dfgenes) + else: + genesinall = genesinall.intersection(set(dfgenes)) + + columnstokeep = ['Gene', 'sample', 'numreads', 'porc'] + df = df[columnstokeep] + dfs.append(df) + + #for each df, filter keeping only the genes present in every df (genesinall) + #Somehow there are some genes whose name in NA + if np.nan in genesinall: + genesinall.remove(np.nan) + dfs = [df.loc[df['Gene'].isin(genesinall)] for df in dfs] + + #concatenate (rbind) dfs together + df = pd.concat(dfs) + + #turn from long into wide + df = df.pivot_table(index = 'Gene', columns = 'sample', values = ['numreads', 'porc']).reset_index() + #flatten multiindex column names + df.columns = ["_".join(a) if '' not in a else a[0] for a in df.columns.to_flat_index()] + + #Filter for genes with at least minreads in every sample + #get columns with numreads info + numreadsColumns = [col for col in df.columns if 'numreads' in col] + #Get minimum in those columns + df['minreadcount'] = df[numreadsColumns].min(axis = 1) + #Filter for rows with minreadcount >= minreads + print('Filtering for genes with at least {0} reads in every sample.'.format(minreads)) + df = df.loc[df['minreadcount'] >= minreads] + print('{0} genes have at least {1} reads in every sample.'.format(len(df), minreads)) + #We also don't want rows with inf/-inf PORC values + df = df.replace([np.inf, -np.inf], np.nan) + df = df.dropna(how= 'any') + #Return a dataframe of just genes and PORC values + columnstokeep = ['Gene'] + [col for col in df.columns if 'porc' in col] + df = df[columnstokeep] + + return df + +def calcDeltaPORC(porcdf, sampconds, conditionA, conditionB): + #Given a porc df from makePORCdf, add deltaporc values. + + deltaporcs = [] + sampconddf = readconditions(sampconds) + + #Get column names in porcdf that are associated with each condition + conditionAsamps = sampconddf.loc[sampconddf['condition'] == conditionA] + conditionAsamps = conditionAsamps['sample'].tolist() + conditionAcolumns = ['porc_' + samp for samp in conditionAsamps] + + conditionBsamps = sampconddf.loc[sampconddf['condition'] == conditionB] + conditionBsamps = conditionBsamps['sample'].tolist() + conditionBcolumns = ['porc_' + samp for samp in conditionBsamps] + + print('Condition A samples: ' + (', ').join(conditionAsamps)) + print('Condition B samples: ' + (', ').join(conditionBsamps)) + + for index, row in porcdf.iterrows(): + condAporcs = [] + condBporcs = [] + for col in conditionAcolumns: + porc = row[col] + condAporcs.append(porc) + for col in conditionBcolumns: + porc = row[col] + condBporcs.append(porc) + + condAporcs = [x for x in condAporcs if x != np.nan] + condBporcs = [x for x in condBporcs if x != np.nan] + condAporc = np.mean(condAporcs) + condBporc = np.mean(condBporcs) + deltaporc = condBporc - condAporc + deltaporc = float(format(deltaporc, '.3f')) + deltaporcs.append(deltaporc) + + porcdf = porcdf.assign(deltaPORC = deltaporcs) + + return porcdf + +def makeContingencyTable(row): + #Given a row from a pigpen df, return a contingency table of the form + #[[convG, nonconvG], [convnonG, nonconvnonG]] + + convG = row['g_t'] + row['g_c'] + nonconvG = row['g_g'] + convnonG = row['a_t'] + row['a_c'] + row['a_g'] + row['c_a'] + row['c_t'] + row['c_g'] + row['t_a'] + row['t_c'] + row['t_g'] + nonconvnonG = row['c_c'] + row['t_t'] + row['a_a'] + + conttable = [[convG, nonconvG], [convnonG, nonconvnonG]] + + return conttable + + +def calculate_nested_f_statistic(small_model, big_model): + #Given two fitted GLMs, the larger of which contains the parameter space of the smaller, return the F Stat and P value corresponding to the larger model adding explanatory power + #No anova test for GLM models in python + #this is a workaround + #https://stackoverflow.com/questions/27328623/anova-test-for-glm-in-python + addtl_params = big_model.df_model - small_model.df_model + f_stat = (small_model.deviance - big_model.deviance) / \ + (addtl_params * big_model.scale) + df_numerator = addtl_params + # use fitted values to obtain n_obs from model object: + df_denom = (big_model.fittedvalues.shape[0] - big_model.df_model) + p_value = stats.f.sf(f_stat, df_numerator, df_denom) + return (f_stat, p_value) + +def getgenep(geneconttable): + #Given a gene-level contingency table of the form: + #[condAtables, condBtables], where each individual sample is of the form + #[[convG, nonconvG], [convnonG, nonconvnonG]], + #run glm either including or excluding condition term + #using likelihood ratio of the two models and chisq test, return p value + + #Turn gene-level contingency tables into df of form + #convcount nonconvcount condition nuc sample + nCondAsamples = len(geneconttable[0]) + nCondBsamples = len(geneconttable[1]) + + # e.g. ['condA', 'condA', 'condB', 'condB'] + cond = ['condA'] * (nCondAsamples * 4) + ['condB'] * (nCondBsamples * 4) + nuc = ['G', 'G', 'nonG', 'nonG'] * nCondAsamples + ['G', 'G', 'nonG', + 'nonG'] * nCondBsamples # e.g. ['G', 'G', 'nonG', 'nonG', ...] + conv = ['yes', 'no', 'yes', 'no'] * nCondAsamples + ['yes', 'no', + 'yes', 'no'] * nCondBsamples # e.g. ['yes', 'no', 'yes', 'no', ...] + samples = ['sample' + str(x + 1) for x in range(nCondAsamples + nCondBsamples)] + samples = list(itertools.repeat(samples, 4)) + samples = list(itertools.chain.from_iterable(samples)) + samples = sorted(samples) + samplenumber = [x + 1 for x in range(len(cond))] + + #to get counts, flatten the very nested list of lists that is geneconttable + a = list(itertools.chain.from_iterable(geneconttable)) + b = list(itertools.chain.from_iterable(a)) + counts = list(itertools.chain.from_iterable(b)) + + d = {'cond': cond, 'nuc': nuc, 'conv': conv, + 'counts': counts, 'sample' : samples} + df = pd.DataFrame.from_dict(d) + #Reshape table to get individual columns for converted and nonconverted nts + df2 = df.pivot_table(index = ['cond', 'nuc', 'sample'], columns = 'conv', values = 'counts').reset_index() + + pandas2ri.activate() + + fmla = 'cbind(yes, no) ~ nuc + cond + nuc:cond + (1 | sample)' + nullfmla = 'cbind(yes, no) ~ nuc + cond + (1 | sample)' + + fullfit = lme4.glmer(formula=fmla, family=stats.binomial, data=df2) + reducedfit = lme4.glmer(formula=nullfmla, family=stats.binomial, data=df2) + + logratio = (stats.logLik(fullfit)[0] - stats.logLik(reducedfit)[0]) * 2 + pvalue = stats.pchisq(logratio, df=2, lower_tail=False)[0] + #format decimal + pvalue = float('{:.2e}'.format(pvalue)) + + return pvalue + +def multihyp(pvalues): + #given a dictionary of {gene : pvalue}, perform multiple hypothesis correction + + #remove genes with p value of NA + cleanedp = {} + for gene in pvalues: + if not np.isnan(pvalues[gene]): + cleanedp[gene] = pvalues[gene] + + cleanedps = list(cleanedp.values()) + fdrs = multipletests(cleanedps, method = 'fdr_bh')[1] + fdrs = dict(zip(list(cleanedp.keys()), fdrs)) + + correctedps = {} + for gene in pvalues: + if np.isnan(pvalues[gene]): + correctedps[gene] = np.nan + else: + correctedps[gene] = fdrs[gene] + + return correctedps + + +def getpvalues(samp_conds_file, conditionA, conditionB): + #each contingency table will be: [[convG, nonconvG], [convnonG, nonconvnonG]] + #These will be stored in a dictionary: {gene : [condAtables, condBtables]} + conttables = {} + + pvalues = {} #{gene : pvalue} + + nsamples = 0 + with open(samp_conds_file, 'r') as infh: + for line in infh: + line = line.strip().split('\t') + if line[0] == 'file': + continue + nsamples +=1 + pigpenfile = line[0] + sample = line[1] + condition = line[2] + df = pd.read_csv(pigpenfile, sep = '\t', index_col = False, header=0) + for idx, row in df.iterrows(): + conttable = makeContingencyTable(row) + gene = row['Gene'] + if gene not in conttables: + conttables[gene] = [[], []] + if condition == conditionA: + conttables[gene][0].append(conttable) + elif condition == conditionB: + conttables[gene][1].append(conttable) + + genecounter = 0 + for gene in conttables: + genecounter +=1 + if genecounter % 1000 == 0: + print('Getting p value for gene {0}...'.format(genecounter)) + geneconttable = conttables[gene] + #Only calculate p values for genes present in every porc file + gene_porcfiles = len(geneconttable[0]) + len(geneconttable[1]) + if nsamples == gene_porcfiles: + try: + p = getgenep(geneconttable) + except RRuntimeError: + p = np.nan + else: + p = np.nan + + pvalues[gene] = p + + correctedps = multihyp(pvalues) + + pdf = pd.DataFrame.from_dict(pvalues, orient = 'index', columns = ['pval']) + fdrdf = pd.DataFrame.from_dict(correctedps, orient = 'index', columns = ['FDR']) + + pdf = pd.merge(pdf, fdrdf, left_index = True, right_index = True).reset_index().rename(columns = {'index' : 'Gene'}) + + return pdf + +def formatporcDF(porcdf): + #Format floats in all porcDF columns + formats = {'deltaPORC': '{:.3f}', 'pval': '{:.3e}', 'FDR': '{:.3e}'} + c = porcdf.columns.tolist() + c = [x for x in c if 'porc_' in x] + for x in c: + formats[x] = '{:.3f}' #all porc_SAMPLE columns + for col, f in formats.items(): + porcdf[col] = porcdf[col].map(lambda x: f.format(x)) + + return porcdf + +if __name__ == '__main__': + utils = importr('utils') + lme4 = importr('lme4') + base = importr('base') + stats = importr('stats') + + #Make df of PORC values + porcdf = makePORCdf(sys.argv[1], 100) + #Add delta porc values + porcdf = calcDeltaPORC(porcdf, sys.argv[1], 'mDBF', 'pDBF') + #Get p values and corrected p values + pdf = getpvalues(sys.argv[1], 'mDBF', 'pDBF') + #Add p values and FDR + porcdf = pd.merge(porcdf, pdf, on = ['Gene']) + #Format floats + porcdf = formatporcDF(porcdf) + + porcdf.to_csv('porc.txt', sep='\t', index = False) diff --git a/pigpen.py b/pigpen.py index e0d5ed1..8462c2f 100644 --- a/pigpen.py +++ b/pigpen.py @@ -81,6 +81,8 @@ print('Assigning reads to genes...') overlaps, numpairs = getReadOverlaps(filteredbam, args.geneBed, args.chromsizes) read2gene = processOverlaps(overlaps, numpairs) + os.remove(filteredbam) + os.remove(filteredbam + '.bai') #Calculate number of conversions per gene numreadspergene, convsPerGene = getPerGene(convs, read2gene) From 356ecfc869861509916b8f5df2a26a8d3fc80a55 Mon Sep 17 00:00:00 2001 From: Matthew Taliaferro Date: Tue, 10 May 2022 13:58:23 -0600 Subject: [PATCH 007/108] Delete bacon.py delete old bacon GLM that just compared PORC values --- bacon.py | 250 ------------------------------------------------------- 1 file changed, 250 deletions(-) delete mode 100644 bacon.py diff --git a/bacon.py b/bacon.py deleted file mode 100644 index c394df2..0000000 --- a/bacon.py +++ /dev/null @@ -1,250 +0,0 @@ -#Bioinformatic Analysis of the Conversion Of Nucleotides (BACON) - -#Identify genes whose PORC values are different across conditions -#Two possibilities: GLM (readconditions, makePORCdf, getLMEp) -#or -#subsample reads, calculate many PORC values, using Hotelling T2 to identify genes with different PORC distributions across conditions - -import pandas as pd -import sys -import numpy as np -from collections import OrderedDict -import math -import statsmodels.api as sm -import statsmodels.formula.api as smf -from statsmodels.stats.multitest import multipletests -from scipy.stats.distributions import chi2 -import warnings -import time - - -def readconditions(samp_conds_file): - #Read in a three column file of oincoutputfile / sample / condition - #Return a pandas df - sampconddf = pd.read_csv(samp_conds_file, sep = '\t', index_col = False, header = 0) - - return sampconddf - -def makePORCdf(samp_conds_file, minreads): - #Make a dataframe of PORC values for all samples - #start with GENE...SAMPLE...READCOUNT...PORC - #then make a wide version that is GENE...SAMPLE1READCOUNT...SAMPLE1PORC...SAMPLE2READCOUNT...SAMPLE2PORC - - #Only keep genes that are present in EVERY pigpen file - - minreads = int(minreads) - dfs = [] #list of dfs from individual pigpen outputs - genesinall = set() #genes present in every pigpen file - with open(samp_conds_file, 'r') as infh: - for line in infh: - line = line.strip().split('\t') - if line[0] == 'file': - continue - pigpenfile = line[0] - sample = line[1] - df = pd.read_csv(pigpenfile, sep = '\t', index_col = False, header = 0) - dfgenes = df['Gene'].tolist() - samplecolumn = [sample] * len(dfgenes) - df = df.assign(sample = samplecolumn) - - if not genesinall: #if there are no genes in there (this is the first file) - genesinall = set(dfgenes) - else: - genesinall = genesinall.intersection(set(dfgenes)) - - columnstokeep = ['Gene', 'sample', 'numreads', 'porc'] - df = df[columnstokeep] - dfs.append(df) - - #for each df, filter keeping only the genes present in every df (genesinall) - #Somehow there are some genes whose name in NA - genesinall.remove(np.nan) - dfs = [df.loc[df['Gene'].isin(genesinall)] for df in dfs] - - #concatenate (rbind) dfs together - df = pd.concat(dfs) - - #turn from long into wide - df = df.pivot_table(index = 'Gene', columns = 'sample', values = ['numreads', 'porc']).reset_index() - #flatten multiindex column names - df.columns = ["_".join(a) if '' not in a else a[0] for a in df.columns.to_flat_index()] - - #Filter for genes with at least minreads in every sample - #get columns with numreads info - numreadsColumns = [col for col in df.columns if 'numreads' in col] - #Get minimum in those columns - df['minreadcount'] = df[numreadsColumns].min(axis = 1) - #Filter for rows with minreadcount >= minreads - print('Filtering for genes with at least {0} reads in every sample.'.format(minreads)) - df = df.loc[df['minreadcount'] >= minreads] - print('{0} genes have at least {1} reads in every sample.'.format(len(df), minreads)) - #We also don't want rows with inf/-inf PORC values - df = df.replace([np.inf, -np.inf], np.nan) - df = df.dropna(how= 'any') - #Return a dataframe of just genes and PORC values - columnstokeep = ['Gene'] + [col for col in df.columns if 'porc' in col] - df = df[columnstokeep] - - return df - -def getLMEp(sampconddf, porcdf, conditionA, conditionB): - #Calculate pvalues for genes based on PORC values across conditions using LME model - #Delta porc values are calculated as conditionB - condition A - - deltaporcdict = OrderedDict() #{genename : deltaporc} ordered so it's easy to match it up with porcdf - pvaluedict = OrderedDict() #{genename : pvalue} ordered so it's easy to match it up with q values - - #Get column names in porcdf that are associated with each condition - conditionAsamps = sampconddf.loc[sampconddf['condition'] == conditionA] - conditionAsamps = conditionAsamps['sample'].tolist() - conditionAcolumns = ['porc_' + samp for samp in conditionAsamps] - - conditionBsamps = sampconddf.loc[sampconddf['condition'] == conditionB] - conditionBsamps = conditionBsamps['sample'].tolist() - conditionBcolumns = ['porc_' + samp for samp in conditionBsamps] - - print('Condition A samples: ' + (', ').join(conditionAsamps)) - print('Condition B samples: ' + (', ').join(conditionBsamps)) - - #Store relationships of conditions and the samples in that condition - #It's important that this dictionary be ordered because we are going to be iterating through it - samp_conds = OrderedDict({'condA' : conditionAcolumns, 'condB' : conditionBcolumns}) - - #Get a list of all samples - samps = [] - for cond in samp_conds: - samps += samp_conds[cond] - - #Iterate through rows, making a dictionary from every row, turning it into a dataframe, then calculating p value - genecounter = 0 - for index, row in porcdf.iterrows(): - genecounter +=1 - if genecounter % 1000 == 0: - print('Calculating pvalue for gene {0}...'.format(genecounter)) - - d = {} - d['Gene'] = [row['Gene']] * len(samps) - d['variable'] = samps - - values = [] #porc values - for cond in samp_conds: - for sample in samp_conds[cond]: - value = row[sample] - values.append(value) - d['value'] = values - - #If there is an NA psi value, we are not going to calculate a pvalue for this gene - p = None - if True in np.isnan(values): - p = np.nan - - conds = [] - for cond in samp_conds: - conds += [cond] * len(samp_conds[cond]) - condAs = [] - condBs = [] - for cond in conds: - if cond == 'condA': - condAs.append(1) - condBs.append(0) - elif cond == 'condB': - condAs.append(0) - condBs.append(1) - d['condA'] = condAs #e.g. [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0] - d['condB'] = condBs #e.g. [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1] - - d['samples'] = [x + 1 for x in range(len(samps))] - - #Turn this dictionary into a DataFrame - rowdf = pd.DataFrame.from_dict(d) - - #delta psi is difference between mean psi of two conditions (cond2 - cond1) - condAmeanporc = float(format(np.mean(rowdf.query('condA == 1').value.dropna()), '.3f')) - condBmeanporc = float(format(np.mean(rowdf.query('condB == 1').value.dropna()), '.3f')) - deltaporc = condBmeanporc - condAmeanporc - deltaporc = float(format(deltaporc, '.3f')) - deltaporcdict[row['Gene']] = deltaporc - - #Get LME pvalue, but only if we haven't already determined that the pvalue is NA because we are missing one or more psi values - #Lots of warnings about convergence, etc. Suppress them. - if not p: - with warnings.catch_warnings(): - warnings.filterwarnings('ignore') - - #So apparently, some combinations of psi values will give nan p values due to a LinAlgError that arises from a singular - #hessian matrix during the fit of the model. However, the equivalent code in R (nlme::lme) never gives this error, even with - #the same data. It's not clear from just looking at the psi values why this is. However, I found that by varying the - #start_params in the fit, this can be avoided. If this is done, the resulting p value always matches what is given in R. - #Further, the p value is the same regardless of the start_param. - #But it's not clear to me why changing the start_param matters, or what the default is here or with nlme. - #So let's try a few starting paramters. Regardless, this seems to affect a small number of genes (<1%), and it is causing - #false negatives because genes that should get p values (may or may not be sig) are getting NA. - possible_start_params = [0, 0, 1, -1, 2, -2] - numberoftries = -1 - for param in possible_start_params: - #if we already have a pvalue, don't try again - if p != None and not np.isnan(p): - break - #First time through, numberoftries = 0, and we are just using a placeholder startparam (0) here because we aren't even using it. - #Gonna use whatever the default is - numberoftries += 1 - try: - #actual model - md = smf.mixedlm('value ~ condA', data=rowdf, groups='samples', missing='drop') - if numberoftries == 0: - # REML needs to be false in order to use log-likelihood for pvalue calculation - mdf = md.fit(reml=False) - elif numberoftries > 0: - mdf = md.fit(reml=False, start_params=[param]) - - #null model - nullmd = smf.mixedlm('value ~ 1', data=rowdf, groups='samples', missing='drop') - if numberoftries == 0: - nullmdf = nullmd.fit(reml=False) - elif numberoftries > 0: - nullmdf = nullmd.fit(reml=False, start_params=[param]) - - #Likelihood ratio - LR = 2 * (mdf.llf - nullmdf.llf) - p = chi2.sf(LR, df=1) - - #These exceptions are needed to catch cases where either all psi values are nan (valueerror) or all psi values for one condition are nan (linalgerror) - except (ValueError, np.linalg.LinAlgError): - p = np.nan - - pvaluedict[row['Gene']] = float('{:.2e}'.format(p)) - - #Correct pvalues using BH method, but only using pvalues that are not NA - pvalues = list(pvaluedict.values()) - pvaluesformultitest = [pval for pval in pvalues if str(pval) != 'nan'] - fdrs = list(multipletests(pvaluesformultitest, method='fdr_bh')[1]) - fdrs = [float('{:.2e}'.format(fdr)) for fdr in fdrs] - - #Now we need to incorporate the places where p = NA into the list of FDRs (also as NA) - fdrswithnas = [] - fdrindex = 0 - for pvalue in pvalues: - if str(pvalue) != 'nan': - fdrswithnas.append(fdrs[fdrindex]) - fdrindex += 1 - elif str(pvalue) == 'nan': - fdrswithnas.append(np.nan) - - #Add deltaporcs, pvalues, and FDRs to df - deltaporcs = list(deltaporcdict.values()) - porcdf = porcdf.assign(deltaporc=deltaporcs) - porcdf = porcdf.assign(pval=pvalues) - porcdf = porcdf.assign(FDR=fdrswithnas) - - #Write df - fn = 'porc.pval.txt' - porcdf.to_csv(fn, sep='\t', index=False, float_format='%.3g', na_rep='NA') - - - - -if __name__ == '__main__': - sampconddf = readconditions(sys.argv[1]) - #print(sampconddf) - porcdf = makePORCdf(sys.argv[1], sys.argv[2]) - getLMEp(sampconddf, porcdf, sys.argv[3], sys.argv[4]) From 2f432f32b554f0d943658d900f2a883722bd11da Mon Sep 17 00:00:00 2001 From: Matthew Taliaferro Date: Tue, 10 May 2022 13:59:31 -0600 Subject: [PATCH 008/108] Delete glm.py remove slamdunk GLM --- glm.py | 180 --------------------------------------------------------- 1 file changed, 180 deletions(-) delete mode 100644 glm.py diff --git a/glm.py b/glm.py deleted file mode 100644 index e4ab00f..0000000 --- a/glm.py +++ /dev/null @@ -1,180 +0,0 @@ -import pandas as pd -import sys -from functools import reduce -import statsmodels.api as sm -import statsmodels.formula.api as smf -from scipy.stats.distributions import chi2 -import numpy as np -from statsmodels.stats.multitest import multipletests -from collections import OrderedDict -from statsmodels.tools.sm_exceptions import PerfectSeparationError as PSE - - -def combinesamples(slamdunkouts, samplenames): - #Given a slew of slam dunk outputs, combine them together into one df - #slamdunkouts is a comma-separated list of filepaths - #samplenames is a comma-separated list of names, in the same order as slamdunkouts - #conditions is a comma-separated list of conditionIDs (max 2), in the same order of slamdunkouts - - dfs = [] - samplenames = samplenames.split(',') - #conditions = conditions.split(',') - for idx, sd in enumerate(slamdunkouts.split(',')): - #skip first 2 header lines - df = pd.read_csv(sd, sep = '\t', skiprows = 2, header = 0) - columns = list(df.columns) - columnstokeep = ['Name', 'G_G', 'G_C', 'G_T'] - columnstodrop = [c for c in columns if c not in columnstokeep] - df = df.drop(labels = columnstodrop, axis = 1) - #Combine G_C and G_T - df = df.assign(G_mut = df['G_C'] + df['G_T']) - #totalGs - df = df.assign(totalG = df['G_G'] + df['G_mut']) - - #rename columns in preparation for merging - columns = list(df.columns) - untouchablecolumnnames = ['Name'] - samplename = samplenames[idx] - columns = [c + ';' + samplename if c not in untouchablecolumnnames else c for c in columns] - df.columns = columns - df = df.drop_duplicates(ignore_index = True) #somehow duplicate transcripts? - dfs.append(df) - - bigdf = reduce(lambda x, y: pd.merge(x, y, on = ['Name']), dfs) - bigdf = bigdf.drop_duplicates(ignore_index = True) - #Remove any row with NA value - bigdf = bigdf.dropna(axis = 0, how = 'any') - - return bigdf - -def classifysamples(samplenames, conditions): - #samplenames is a comma-separated list of names - #conditions is a comma-separated list of conditionIDs (max 2), in the same order of samplenames - samplenames = samplenames.split(',') - conditions = conditions.split(',') - d = dict(zip(samplenames, conditions)) - - return d - -def doglm(bigdf, sampconds): - genecounter = 0 - ps = OrderedDict() #{genename : p} - mintotalGs = OrderedDict() #{genename : min number of total Gs across all samples} - meantotalGs = OrderedDict() #{genename : mean number of total Gs across all samples} - condArates = OrderedDict() #{genename : condA mean mutation rate} - condBrates = OrderedDict() #{genename : condA mean mutation rate} - ratelog2fc = OrderedDict() #{genename : log2fc in mutation rates (B/A)} - #sampconds is dict of {samplename : condition} - for index, row in bigdf.iterrows(): - genecounter +=1 - if genecounter % 1000 == 0: - print('Gene {0}...'.format(genecounter)) - samples = list(sampconds.keys()) - G_Gs = [] - G_Cs = [] - G_Ts = [] - G_muts = [] - totalGs = [] - conds = [] - for sample in samples: - G_G = row['G_G;{0}'.format(sample)] - G_C = row['G_C;{0}'.format(sample)] - G_T = row['G_T;{0}'.format(sample)] - G_mut = row['G_mut;{0}'.format(sample)] - totalG = row['totalG;{0}'.format(sample)] - cond = sampconds[sample] - G_Gs.append(G_G) - G_Cs.append(G_C) - G_Ts.append(G_T) - G_muts.append(G_mut) - totalGs.append(totalG) - conds.append(cond) - - mintotalGs[row['Name']] = min(totalGs) - meantotalGs[row['Name']] = np.mean(totalGs) - d = {'G_G' : G_Gs, 'G_C' : G_Cs, 'G_T' : G_Ts, 'G_mut' : G_muts, 'totalG' : totalGs, 'cond' : conds} - rowdf = pd.DataFrame.from_dict(d) - totalGs = rowdf['totalG'].tolist() - totalmuts = rowdf['G_mut'].tolist() - minG = min(totalGs) - minmut = min(totalmuts) - #Implement totalG count filter - if minG < 100: - p = np.nan - - else: - try: - #GLM - mod_real = smf.glm('G_mut + G_G ~ cond', family = sm.families.Binomial(), data = rowdf).fit() - mod_null = smf.glm('G_mut + G_G ~ 1', family = sm.families.Binomial(), data = rowdf).fit() - - #Likelihood ratio test - logratio = (mod_real.llf - mod_null.llf) * 2 - p = round(chi2.sf(logratio, df = 1), 4) - #If all mutation counts in one condition are 0, this causes a problem called Perfect Separation Error - #Interestingly, this is not triggered if there is only 1 replicate in a condition - except PSE: - p = np.nan - - - ps[row['Name']] = p - - #Calculate mean conversion rates for each condition and the difference between them - conds = sorted(list(set(conds))) #conds are alphabetically sorted. condA is the first one, condB is the second one - for idx, cond in enumerate(conds): - conddf = rowdf[rowdf['cond'] == cond] - rates = [] - for condi, condr in conddf.iterrows(): - try: - rate = condr['G_mut'] / condr['totalG'] - except ZeroDivisionError: - rate = np.nan - - rates.append(rate) - - meanrate = np.mean(rates) - if idx == 0: - condArates[row['Name']] = meanrate - condArate = meanrate - elif idx == 1: - condBrates[row['Name']] = meanrate - condBrate = meanrate - - pc = 1e-6 - log2fc = np.log2((condBrate + pc) / (condArate + pc)) - ratelog2fc[row['Name']] = log2fc - - #Correct pvalues using BH method, but only using pvalues that are not NA - pvalues = list(ps.values()) - pvaluesformultitest = [pval for pval in pvalues if str(pval) != 'nan'] - fdrs = list(multipletests(pvaluesformultitest, method = 'fdr_bh')[1]) - fdrs = [float('{:.2e}'.format(fdr)) for fdr in fdrs] - - #Now we need to incorporate the places where p = NA into the list of FDRs (also as NA) - fdrswithnas = [] - fdrindex = 0 - for pvalue in pvalues: - if str(pvalue) != 'nan': - fdrswithnas.append(fdrs[fdrindex]) - fdrindex +=1 - elif str(pvalue) == 'nan': - fdrswithnas.append(np.nan) - - genes = bigdf['Name'].tolist() - outd = {'Gene' : genes, 'minGcount' : list(mintotalGs.values()), 'meanGcount' : list(meantotalGs.values()), '{0}mutrate'.format(conds[0]) : list(condArates.values()), '{0}mutrate'.format(conds[1]) : list(condBrates.values()), 'log2fc' : list(ratelog2fc.values()), 'p' : list(ps.values()), 'FDR' : fdrswithnas} - outdf = pd.DataFrame.from_dict(outd) - - #format output columns - formats = {'minGcount': '{:d}', 'meanGcount': '{:.2f}', '{0}mutrate'.format(conds[0]): '{:.2e}', '{0}mutrate'.format(conds[1]): '{:.2e}', 'log2fc': '{:.2f}', 'p': '{:.3f}', 'FDR': '{:.3f}'} - for col, f in formats.items(): - outdf[col] = outdf[col].map(lambda x : f.format(x)) - - outdf.to_csv('glm.txt', sep = '\t', header = True, index = False, na_rep = 'NA') - - -#Usage: glm.py -#e.g. glm.py output1.txt,output2.txt,output3.txt,output4.txt S1R1,S1R2,S2R1,S2R2, S1,S1,S2,S2 - -bigdf = combinesamples(sys.argv[1], sys.argv[2]) -sampconds = classifysamples(sys.argv[2], sys.argv[3]) -doglm(bigdf, sampconds) From f7cbf774bad4155d39e6578e35459e814ed52889 Mon Sep 17 00:00:00 2001 From: Matthew Taliaferro Date: Tue, 10 May 2022 15:35:04 -0600 Subject: [PATCH 009/108] update readme for bacon --- README.md | 35 +++++++++++++++++++++++++++++++++-- 1 file changed, 33 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index ccf9699..5d696b1 100644 --- a/README.md +++ b/README.md @@ -37,6 +37,14 @@ PIGPEN has the following prerequisites: - bamtools >= 2.5.1 - bedtools >= 2.30.0 +BACON has the following prerequisites: + +- python >= 3.6 +- statsmodels >= 0.13.2 +- numpy >= 1.21 +- rpy2 >= 3.4.5 +- R >= 4.1 + ## Installation For now, installation can be done by cloning this repository. As PIPGEN matures, we will work towards getting this package on [bioconda](https://bioconda.github.io/). @@ -47,7 +55,7 @@ For now, installation can be done by cloning this repository. As PIPGEN matures, PIGPEN performs this by using [varscan](http://varscan.sourceforge.net/using-varscan.html) to find SNP positions. These locations are then excluded from all future analyses. Varscan parameters are controled by the PIGPEN parameters `--SNPcoverage` and `--SNPfreq` that control the depth and frequency required to call a SNP. We recommend being aggressive with these parameters. We often set them to 20 and 0.02, respectively. -PIGPEN performs this SNP calling on control alignment files (`--controlBams`) in which the intended oxidation did not occur. PIGPEN will use the union of all SNPs found in these files for masking. Whether or not to call SNPs at all (you definitely should) is controlled by `--useSNPs`. +PIGPEN performs this SNP calling on control alignment files (`--controlBams`) in which the intended oxidation did not occur. PIGPEN will use the union of all SNPs found in these files for masking. Whether or not to call SNPs at all (you probably should) is controlled by `--useSNPs`. This process can be time consuming. At the end, a file called **merged.vcf** is created in the current working directory. If this file is present, PIGPEN will assume that it should be used for SNP masking, allowing the process of identifying SNPs to be skipped. @@ -71,6 +79,29 @@ For now, PIGPEN using `bedtools` and a supplied bed file of gene locations (`--g After identifying the conversions present in each read and the cognate gene for each read, the number of conversions for each gene is calculated. We have observed that the overall rate of conversions (not just G -> T + G -> C, but all conversions) can vary signficantly from sample to sample, presumably due to a technical effect in library preparation. For this reason, PIGPEN calculates **PORC** (Proportion of Relevant Conversions) values. This is the log2 ratio of the relevant conversion rate ([G -> T + G -> C] / total number of reference G encountered) to the overall conversion rate (total number of all conversions / total number of positions interrogated). PORC therefore normalizes to the overall rate of conversions, removing this technical effect. +PIGPEN can use G -> T conversions, G -> C conversions, or both when calculating PORC values. This behavior is controlled by supplying the options `--use_g_t` and `--use_g_c`. + ## Statistical framework for comparing gene-level PORC values across conditions -We are working on this. +We could simply compare PORC values across conditions, but with that approach we lose information about the number of counts (conversions) that went into the PORC calculation. + +For each gene, PIGPEN calculates the number of relevant conversions (G -> T + G -> C) as well as all other conversions encountered. Each gene therefore ends up with a 2x2 contingency table of the following form: + +|| converted | not converted | +| ----------------|---------------|-------- | +| G -> C or G -> T | a | b | +| other conversion | c | d | + +We then want to compare groups (replicates) of contingency tables across conditions. BACON (Bioinformatic Analysis of the Conversion of Nucleotides) performs this comparison using a binomial linear mixed-effects model. Replicates are modeled as random effects. + +`full model = conversions ~ nucleotide + condition + nucleotide:condition + (1 | replicate)` + +`null model = conversions ~ nucleotide + condition + (1 | replicate)` + +The two models are then compared using a likelihood ratio test. + +As input, BACON takes a tab-delimited, headered file of the following form with one row per sample: + +| file | sample | condition | +| -----|--------|-----------| +| /path/to/pigpen/output | sample name | condition ID| From 219d38658d55ca2a64d0d690fce572351ba28615 Mon Sep 17 00:00:00 2001 From: Matthew Taliaferro Date: Thu, 12 May 2022 09:52:06 -0600 Subject: [PATCH 010/108] change multiconv option to nconv integer --- getmismatches.py | 56 ++++++++++++++++++++++++------------------------ pigpen.py | 6 +++--- 2 files changed, 31 insertions(+), 31 deletions(-) diff --git a/getmismatches.py b/getmismatches.py index f7cd595..974ddc4 100644 --- a/getmismatches.py +++ b/getmismatches.py @@ -176,7 +176,7 @@ def findsnps(controlbams, genomefasta, minCoverage = 20, minVarFreq = 0.02): return snps -def iteratereads_pairedend(bam, onlyConsiderOverlap, use_g_t, use_g_c, snps=None, requireMultipleConv=False, verbosity='high'): +def iteratereads_pairedend(bam, onlyConsiderOverlap, use_g_t, use_g_c, nConv, snps=None, verbosity='high'): #Iterate over reads in a paired end alignment file. #Find nt conversion locations for each read. #For locations interrogated by both mates of read pair, conversion must exist in both mates in order to count @@ -242,7 +242,7 @@ def iteratereads_pairedend(bam, onlyConsiderOverlap, use_g_t, use_g_c, snps=None read1qualities = list(read1.query_qualities) #phred scores read2qualities = list(read2.query_qualities) - convs_in_read = getmismatches_pairedend(read1alignedpairs, read2alignedpairs, read1queryseq, read2queryseq, read1qualities, read2qualities, read1strand, read2strand, snplocations, onlyConsiderOverlap, requireMultipleConv, use_g_t, use_g_c) + convs_in_read = getmismatches_pairedend(read1alignedpairs, read2alignedpairs, read1queryseq, read2queryseq, read1qualities, read2qualities, read1strand, read2strand, snplocations, onlyConsiderOverlap, nConv, use_g_t, use_g_c) queriednts.append(sum(convs_in_read.values())) convs[queryname] = convs_in_read @@ -254,7 +254,7 @@ def iteratereads_pairedend(bam, onlyConsiderOverlap, use_g_t, use_g_c, snps=None return convs, readcounter -def getmismatches_pairedend(read1alignedpairs, read2alignedpairs, read1queryseq, read2queryseq, read1qualities, read2qualities, read1strand, read2strand, snplocations, onlyoverlap, requireMultipleConv, use_g_t, use_g_c): +def getmismatches_pairedend(read1alignedpairs, read2alignedpairs, read1queryseq, read2queryseq, read1qualities, read2qualities, read1strand, read2strand, snplocations, onlyoverlap, nConv, use_g_t, use_g_c): #remove tuples that have None #These are either intronic or might have been soft-clipped #Tuples are (querypos, refpos, refsequence) @@ -412,29 +412,28 @@ def getmismatches_pairedend(read1alignedpairs, read2alignedpairs, read1queryseq, else: pass - #if we are requiring there be multiple g_t or g_c, implement that here - if requireMultipleConv: - if use_g_t and use_g_c: - if convs['g_t'] + convs['g_c'] >= 2: - pass - elif convs['g_t'] + convs['g_c'] < 2: - convs['g_t'] = 0 - convs['g_c'] = 0 - elif use_g_t and not use_g_c: - if convs['g_t'] >= 2: - pass - elif convs['g_t'] < 2: - convs['g_t'] = 0 - convs['g_c'] = 0 - elif use_g_c and not use_g_t: - if convs['g_c'] >= 2: - pass - elif convs['g_c'] < 2: - convs['g_c'] = 0 - convs['g_t'] = 0 - elif not use_g_t and not use_g_c: - print('ERROR: we have to be looking for at least either G->T or G->C if not both!!') - sys.exit() + #Does the number of g_t and/or g_c conversions meet our threshold? + if use_g_t and use_g_c: + if convs['g_t'] + convs['g_c'] >= nConv: + pass + elif convs['g_t'] + convs['g_c'] < nConv: + convs['g_t'] = 0 + convs['g_c'] = 0 + elif use_g_t and not use_g_c: + if convs['g_t'] >= nConv: + pass + elif convs['g_t'] < nConv: + convs['g_t'] = 0 + convs['g_c'] = 0 + elif use_g_c and not use_g_t: + if convs['g_c'] >= nConv: + pass + elif convs['g_c'] < nConv: + convs['g_c'] = 0 + convs['g_t'] = 0 + elif not use_g_t and not use_g_c: + print('ERROR: we have to be looking for at least either G->T or G->C if not both!!') + sys.exit() return convs @@ -551,7 +550,7 @@ def split_bam(bam, nproc): return splitbams -def getmismatches(bam, onlyConsiderOverlap, snps, requireMultipleConv, nproc, use_g_t, use_g_c): +def getmismatches(bam, onlyConsiderOverlap, snps, nConv, nproc, use_g_t, use_g_c): #Actually run the mismatch code (calling iteratereads_pairedend) #use multiprocessing #If there's only one processor, easier to use iteratereads_pairedend() directly. @@ -561,7 +560,8 @@ def getmismatches(bam, onlyConsiderOverlap, snps, requireMultipleConv, nproc, us splitbams = split_bam(bam, int(nproc)) argslist = [] for x in splitbams: - argslist.append((x, bool(onlyConsiderOverlap), bool(use_g_t), bool(use_g_c), snps, bool(requireMultipleConv), 'low')) + argslist.append((x, bool(onlyConsiderOverlap), bool( + use_g_t), bool(use_g_c), nConv, snps, 'low')) #items returned from iteratereads_pairedend are in a list, one per process totalreadcounter = 0 #number of reads across all the split bams diff --git a/pigpen.py b/pigpen.py index 8462c2f..8b7fb0b 100644 --- a/pigpen.py +++ b/pigpen.py @@ -27,7 +27,7 @@ parser.add_argument('--onlyConsiderOverlap', action = 'store_true', help = 'Only consider conversions seen in both reads of a read pair?') parser.add_argument('--use_g_t', action = 'store_true', help = 'Consider G->T conversions?') parser.add_argument('--use_g_c', action = 'store_true', help = 'Consider G->C conversions?') - parser.add_argument('--requireMultipleConv', action = 'store_true', help = 'Only consider conversions seen in reads with multiple G->C + G->T conversions?') + parser.add_argument('--nConv', type = int, help = 'Minimum number of required G->T and/or G->C conversions in a read pair in order for conversions to be counted. Default is 1.', default = 1) args = parser.parse_args() #We have to be either looking for G->T or G->C, if not both @@ -72,9 +72,9 @@ #Identify conversions if args.nproc == 1: - convs, readcounter = iteratereads_pairedend(filteredbam, args.onlyConsiderOverlap, args.use_g_t, args.use_g_c, snps, args.requireMultipleConv, 'high') + convs, readcounter = iteratereads_pairedend(filteredbam, args.onlyConsiderOverlap, args.use_g_t, args.use_g_c, snps, args.nConv, 'high') elif args.nproc > 1: - convs = getmismatches(filteredbam, args.onlyConsiderOverlap, snps, args.requireMultipleConv, args.nproc, args.use_g_t, args.use_g_c) + convs = getmismatches(filteredbam, args.onlyConsiderOverlap, snps, args.nConv, args.nproc, args.use_g_t, args.use_g_c) #Assign reads to genes From 17816dfa285a521b9c46566e761a3c365080ccb7 Mon Sep 17 00:00:00 2001 From: Matthew Taliaferro Date: Fri, 13 May 2022 09:44:39 -0600 Subject: [PATCH 011/108] fix bug in quality score reading --- getmismatches.py | 70 +++++++++++++++++++++++++++--------------------- 1 file changed, 40 insertions(+), 30 deletions(-) diff --git a/getmismatches.py b/getmismatches.py index 974ddc4..7cd3714 100644 --- a/getmismatches.py +++ b/getmismatches.py @@ -260,8 +260,27 @@ def getmismatches_pairedend(read1alignedpairs, read2alignedpairs, read1queryseq, #Tuples are (querypos, refpos, refsequence) #If there is a substitution, refsequence is lower case - #In quantseq-fwd, r1 is always sense strand - + #remove positions where querypos is None + #i'm pretty sure these query positions won't have quality scores + read1alignedpairs = [x for x in read1alignedpairs if x[0] != None] + read2alignedpairs = [x for x in read2alignedpairs if x[0] != None] + + #Add quality scores to alignedpairs tuples + #will now be (querypos, refpos, refsequence, qualityscore) + read1ap_withq = [] + for ind, x in enumerate(read1alignedpairs): + x += (read1qualities[ind],) + read1ap_withq.append(x) + read1alignedpairs = read1ap_withq + + read2ap_withq = [] + for ind, x in enumerate(read2alignedpairs): + x += (read2qualities[ind],) + read2ap_withq.append(x) + read2alignedpairs = read1ap_withq + + #Now remove positions where refsequence is None + #These may be places that got soft-clipped read1alignedpairs = [x for x in read1alignedpairs if None not in x] read2alignedpairs = [x for x in read2alignedpairs if None not in x] @@ -287,35 +306,39 @@ def getmismatches_pairedend(read1alignedpairs, read2alignedpairs, read1queryseq, #These locations (as defined by their reference positions) would be found both in read1alignedpairs and read2alignedpairs #Get the ref positions queried by the two reads - r1dict = {} #{reference position : [queryposition, reference sequence]} + r1dict = {} #{reference position : [queryposition, reference sequence, quality]} r2dict = {} for x in read1alignedpairs: - r1dict[int(x[1])] = [x[0], x[2]] + r1dict[int(x[1])] = [x[0], x[2], x[3]] for x in read2alignedpairs: - r2dict[int(x[1])] = [x[0], x[2]] + r2dict[int(x[1])] = [x[0], x[2], x[3]] - mergedalignedpairs = {} # {refpos : [R1querypos, R2querypos, R1refsequence, R2refsequence]} + mergedalignedpairs = {} # {refpos : [R1querypos, R2querypos, R1refsequence, R2refsequence, R1quality, R2quality]} #For positions only in R1 or R2, querypos and refsequence are NA for the other read for refpos in r1dict: r1querypos = r1dict[refpos][0] r1refseq = r1dict[refpos][1] + r1quality = r1dict[refpos][2] if refpos in mergedalignedpairs: #this should not be possible because we are looking at r1 first r2querypos = mergedalignedpairs[refpos][1] r2refseq = mergedalignedpairs[refpos][3] - mergedalignedpairs[refpos] = [r1querypos, r2querypos, r1refseq, r2refseq] + r2quality = mergedalignedpairs[refpos][5] + mergedalignedpairs[refpos] = [r1querypos, r2querypos, r1refseq, r2refseq, r1quality, r2quality] else: - mergedalignedpairs[refpos] = [r1querypos, 'NA', r1refseq, 'NA'] + mergedalignedpairs[refpos] = [r1querypos, 'NA', r1refseq, 'NA', r1quality, 'NA'] for refpos in r2dict: #same thing r2querypos = r2dict[refpos][0] r2refseq = r2dict[refpos][1] + r2quality = r2dict[refpos][2] if refpos in mergedalignedpairs: #if we saw it for r1 r1querypos = mergedalignedpairs[refpos][0] r1refseq = mergedalignedpairs[refpos][2] - mergedalignedpairs[refpos] = [r1querypos, r2querypos, r1refseq, r2refseq] + r1quality = mergedalignedpairs[refpos][4] + mergedalignedpairs[refpos] = [r1querypos, r2querypos, r1refseq, r2refseq, r1quality, r2quality] else: - mergedalignedpairs[refpos] = ['NA', r2querypos, 'NA', r2refseq] + mergedalignedpairs[refpos] = ['NA', r2querypos, 'NA', r2refseq, 'NA', r2quality] #Now go through mergedalignedpairs, looking for conversions. @@ -327,6 +350,8 @@ def getmismatches_pairedend(read1alignedpairs, read2alignedpairs, read1queryseq, r2querypos = mergedalignedpairs[refpos][1] r1refseq = mergedalignedpairs[refpos][2] r2refseq = mergedalignedpairs[refpos][3] + r1quality = mergedalignedpairs[refpos][4] + r2quality = mergedalignedpairs[refpos][5] if r1querypos != 'NA' and r2querypos == 'NA': #this position queried by r1 only if read1strand == '-': @@ -345,7 +370,7 @@ def getmismatches_pairedend(read1alignedpairs, read2alignedpairs, read1queryseq, querynt = revcomp(querynt) conv = r1refseq.lower() + '_' + querynt.lower() - if read1qualities[r1querypos] >= 30 and onlyoverlap == False: + if r1quality >= 30 and onlyoverlap == False: convs[conv] +=1 else: pass @@ -369,7 +394,7 @@ def getmismatches_pairedend(read1alignedpairs, read2alignedpairs, read1queryseq, querynt = revcomp(querynt) conv = r2refseq.lower() + '_' + querynt.lower() - if read2qualities[r2querypos] >= 30 and onlyoverlap == False: + if r2quality >= 30 and onlyoverlap == False: convs[conv] +=1 else: pass @@ -382,8 +407,8 @@ def getmismatches_pairedend(read1alignedpairs, read2alignedpairs, read1queryseq, if r1refseq == 'N' or r2refseq == 'N' or r1refseq == 'n' or r2refseq == 'n': continue - #If the position is not high quality in either r1 or r2, skip it - if read1qualities[r1querypos] < 30 and read2qualities[r2querypos] < 30: + #If the position is not high quality in both r1 and r2, skip it + if r1quality < 30 and r2quality < 30: continue if r1refseq.isupper() and r2refseq.isupper(): #both reads agree it is not a conversion @@ -591,21 +616,6 @@ def getmismatches(bam, onlyConsiderOverlap, snps, nConv, nproc, use_g_t, use_g_c if __name__ == '__main__': - snps = findsnps('trash', None, 20, 0.02) #control bams (comma separated), genomefasta - - #convs = iteratereads_singleend(sys.argv[1], None) - convs = iteratereads_pairedend(sys.argv[1], True, snps, False) - #with open('OINC3.mDBF.subsampled.filtered.convs.pkl', 'wb') as outfh: - #pickle.dump(convs, outfh) - #summarize_convs(convs, sys.argv[2]) - overlaps, numpairs = getReadOverlaps(sys.argv[1], sys.argv[2], sys.argv[3]) #bam, geneBed, chrom.sizes - read2gene = processOverlaps(overlaps, numpairs) - with open('read2gene.pkl', 'wb') as outfh: - pickle.dump(read2gene, outfh) - with open('convs.pkl', 'wb') as outfh: - pickle.dump(convs, outfh) - - numreadspergene, convsPerGene = getPerGene(convs, read2gene) - writeConvsPerGene(numreadspergene, convsPerGene, sys.argv[4]) #output + iteratereads_pairedend(sys.argv[1], False, True, True, 1, None, 'high') From 56cb13a0a9ed71f505efabb86dc9e01634818fcb Mon Sep 17 00:00:00 2001 From: Matthew Taliaferro Date: Mon, 23 May 2022 15:23:42 -0600 Subject: [PATCH 012/108] add script for parsing output of bam-readcount --- parsebamreadcount.py | 56 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 56 insertions(+) create mode 100644 parsebamreadcount.py diff --git a/parsebamreadcount.py b/parsebamreadcount.py new file mode 100644 index 0000000..b5cbbdb --- /dev/null +++ b/parsebamreadcount.py @@ -0,0 +1,56 @@ +import sys +import os +import argparse + +#Take the output of bam-readcount and parse to get a table of nt fractions at each position + +def parsebrc(bamreadcountout, covthresh, outfile): + d = {} #{chrm : {position : [ref, depth, Afrac, Tfrac, Gfrac, Cfrac]}} + + with open(bamreadcountout, 'r') as infh: + for line in infh: + line = line.strip().split('\t') + chrm = line[0] + position = int(line[1]) + ref = line[2] + depth = int(line[3]) + if depth < covthresh: + continue + + if chrm not in d: + d[chrm] = {} + d[chrm][position] = [ref, depth] + ntfracs = {} #{nt : frac of reads} + for f in line[4:]: + fsplit = f.split(':') + nt = fsplit[0] + if nt not in ['A', 'C', 'T', 'G']: + continue + ntcount = int(fsplit[1]) + ntfrac = ntcount / depth + ntfrac = f'{ntfrac:.2e}' + ntfrac = float(ntfrac) + if nt == ref: + ntfrac = 'NA' + ntfracs[nt] = ntfrac + + d[chrm][position].extend([ntfracs['A'], ntfracs['T'], ntfracs['G'], ntfracs['C']]) + + with open(outfile, 'w') as outfh: + outfh.write(('\t').join(['chrm', 'position', 'ref', 'depth', 'Afrac', 'Tfrac', 'Gfrac', 'Cfrac']) + '\n') + for chrm in d: + for position in d[chrm]: + ref, depth, afrac, tfrac, gfrac, cfrac = d[chrm][position] + outfh.write(('\t').join([chrm, str(position), ref, str(depth), str(afrac), str(tfrac), str(gfrac), str(cfrac)]) + '\n') + + +if __name__ == '__main__': + parser = argparse.ArgumentParser() + parser.add_argument('--bamreadcountout', type = str, help = 'Output of bam-readcount.') + parser.add_argument('--mindepth', type = int, help = 'Minimum depth for a position to be considered.') + parser.add_argument('--outfile', type = str, help = 'Output file.') + args = parser.parse_args() + + parsebrc(args.bamreadcountout, args.mindepth, args.outfile) + + \ No newline at end of file From ca058e7817f66410e870ab5843d5b6dd0dacecf4 Mon Sep 17 00:00:00 2001 From: Matthew Taliaferro Date: Tue, 24 May 2022 14:31:14 -0600 Subject: [PATCH 013/108] add ability to manually mask positions --- getmismatches.py | 41 ++++++++++++++++++++++++++++++----------- maskpositions.py | 32 ++++++++++++++++++++++++++++++++ pigpen.py | 13 +++++++++++-- 3 files changed, 73 insertions(+), 13 deletions(-) create mode 100644 maskpositions.py diff --git a/getmismatches.py b/getmismatches.py index 7cd3714..2e06715 100644 --- a/getmismatches.py +++ b/getmismatches.py @@ -176,7 +176,7 @@ def findsnps(controlbams, genomefasta, minCoverage = 20, minVarFreq = 0.02): return snps -def iteratereads_pairedend(bam, onlyConsiderOverlap, use_g_t, use_g_c, nConv, snps=None, verbosity='high'): +def iteratereads_pairedend(bam, onlyConsiderOverlap, use_g_t, use_g_c, nConv, snps=None, maskpositions=None, verbosity='high'): #Iterate over reads in a paired end alignment file. #Find nt conversion locations for each read. #For locations interrogated by both mates of read pair, conversion must exist in both mates in order to count @@ -216,15 +216,34 @@ def iteratereads_pairedend(bam, onlyConsiderOverlap, use_g_t, use_g_c, nConv, sn queryname = read1.query_name chrm = read1.reference_name + #Get a set of positions to mask (snps + locations we want to mask) #Get a set of snp locations if we have them if snps: if chrm in snps: - snplocations = snps[chrm] #set of coordinates to mask + snplocations = snps[chrm] #set of snp coordinates to mask else: snplocations = None else: snplocations = None + #Get a set of locations to mask if we have them + if maskpositions: + if chrm in maskpositions: + masklocations = maskpositions[chrm] #set of coordinates to manually mask + else: + masklocations = None + else: + masklocations = None + + #combine snps and manually masked positions into one set + #this combined set will be masklocations + if snplocations and masklocations: + masklocations.update(snplocations) + elif snplocations and not masklocations: + masklocations = snplocations + elif masklocations and not snplocations: + masklocations = masklocations + read1queryseq = read1.query_sequence read1alignedpairs = read1.get_aligned_pairs(with_seq = True) if read1.is_reverse: @@ -242,7 +261,7 @@ def iteratereads_pairedend(bam, onlyConsiderOverlap, use_g_t, use_g_c, nConv, sn read1qualities = list(read1.query_qualities) #phred scores read2qualities = list(read2.query_qualities) - convs_in_read = getmismatches_pairedend(read1alignedpairs, read2alignedpairs, read1queryseq, read2queryseq, read1qualities, read2qualities, read1strand, read2strand, snplocations, onlyConsiderOverlap, nConv, use_g_t, use_g_c) + convs_in_read = getmismatches_pairedend(read1alignedpairs, read2alignedpairs, read1queryseq, read2queryseq, read1qualities, read2qualities, read1strand, read2strand, masklocations, onlyConsiderOverlap, nConv, use_g_t, use_g_c) queriednts.append(sum(convs_in_read.values())) convs[queryname] = convs_in_read @@ -254,7 +273,7 @@ def iteratereads_pairedend(bam, onlyConsiderOverlap, use_g_t, use_g_c, nConv, sn return convs, readcounter -def getmismatches_pairedend(read1alignedpairs, read2alignedpairs, read1queryseq, read2queryseq, read1qualities, read2qualities, read1strand, read2strand, snplocations, onlyoverlap, nConv, use_g_t, use_g_c): +def getmismatches_pairedend(read1alignedpairs, read2alignedpairs, read1queryseq, read2queryseq, read1qualities, read2qualities, read1strand, read2strand, masklocations, onlyoverlap, nConv, use_g_t, use_g_c): #remove tuples that have None #These are either intronic or might have been soft-clipped #Tuples are (querypos, refpos, refsequence) @@ -284,11 +303,11 @@ def getmismatches_pairedend(read1alignedpairs, read2alignedpairs, read1queryseq, read1alignedpairs = [x for x in read1alignedpairs if None not in x] read2alignedpairs = [x for x in read2alignedpairs if None not in x] - #if we have snps, remove their locations from read1alignedpairs and read2alignedpairs - #snplocations is a set of 0-based coordinates of snp locations to mask - if snplocations: - read1alignedpairs = [x for x in read1alignedpairs if x[1] not in snplocations] - read2alignedpairs = [x for x in read2alignedpairs if x[1] not in snplocations] + #if we have locations to mask, remove their locations from read1alignedpairs and read2alignedpairs + #masklocations is a set of 0-based coordinates of snp locations to mask + if masklocations: + read1alignedpairs = [x for x in read1alignedpairs if x[1] not in masklocations] + read2alignedpairs = [x for x in read2alignedpairs if x[1] not in masklocations] convs = {} #counts of conversions x_y where x is reference sequence and y is query sequence @@ -575,7 +594,7 @@ def split_bam(bam, nproc): return splitbams -def getmismatches(bam, onlyConsiderOverlap, snps, nConv, nproc, use_g_t, use_g_c): +def getmismatches(bam, onlyConsiderOverlap, snps, maskpositions, nConv, nproc, use_g_t, use_g_c): #Actually run the mismatch code (calling iteratereads_pairedend) #use multiprocessing #If there's only one processor, easier to use iteratereads_pairedend() directly. @@ -586,7 +605,7 @@ def getmismatches(bam, onlyConsiderOverlap, snps, nConv, nproc, use_g_t, use_g_c argslist = [] for x in splitbams: argslist.append((x, bool(onlyConsiderOverlap), bool( - use_g_t), bool(use_g_c), nConv, snps, 'low')) + use_g_t), bool(use_g_c), nConv, snps, maskpositions, 'low')) #items returned from iteratereads_pairedend are in a list, one per process totalreadcounter = 0 #number of reads across all the split bams diff --git a/maskpositions.py b/maskpositions.py new file mode 100644 index 0000000..f787fff --- /dev/null +++ b/maskpositions.py @@ -0,0 +1,32 @@ +#Given a bed of positions to mask, extract them and store them in a dictionary +#so they can be given to getmismatches.py. +#Beds are 0-based half-open, and we want to give getmismatches 0-based coordinates for easy +#interfacing with pysam. + +def readmaskbed(maskbed): + maskdict = {} #{chrm : (set of positions to mask)} + with open(maskbed, 'r') as infh: + for line in infh: + line = line.strip().split('\t') + chrm = line[0] + start = int(line[1]) + end = int(line[2]) + interval = list(range(start, end)) + if chrm not in maskdict: + maskdict[chrm] = [] + maskdict[chrm].extend(interval) + + + #Remove duplicates + for chrm in maskdict: + coords = set(maskdict[chrm]) + maskdict[chrm] = coords + + #Tell us how many positions we are masking + totalmask = 0 + for chrm in maskdict: + totalmask += len(maskdict[chrm]) + + print('Manually masking {0} positions.'.format(totalmask)) + + return maskdict \ No newline at end of file diff --git a/pigpen.py b/pigpen.py index 8b7fb0b..a252028 100644 --- a/pigpen.py +++ b/pigpen.py @@ -6,6 +6,7 @@ import os import sys from snps import getSNPs, recordSNPs +from maskpositions import readmaskbed from filterbam import intersectreads, filterbam, intersectreads_multiprocess from getmismatches import iteratereads_pairedend, getmismatches from assignreads import getReadOverlaps, processOverlaps @@ -22,6 +23,7 @@ parser.add_argument('--output', type = str, help = 'Output file of conversion rates for each gene.') parser.add_argument('--nproc', type = int, help = 'Number of processors to use. Default is 1.', default = 1) parser.add_argument('--useSNPs', action = 'store_true', help = 'Consider SNPs?') + parser.add_argument('--maskbed', help = 'Optional. Bed file of positions to mask from analysis.', default = None) parser.add_argument('--SNPcoverage', type = int, help = 'Minimum coverage to call SNPs. Default = 20', default = 20) parser.add_argument('--SNPfreq', type = float, help = 'Minimum variant frequency to call SNPs. Default = 0.02', default = 0.02) parser.add_argument('--onlyConsiderOverlap', action = 'store_true', help = 'Only consider conversions seen in both reads of a read pair?') @@ -60,6 +62,13 @@ elif not args.useSNPs: snps = None + #Get positions to manually mask if given + if args.maskbed: + print('Getting positions to manually mask...') + maskpositions = readmaskbed(args.maskbed) + elif not args.maskbed: + maskpositions = None + #Filter bam for reads contained within entries in geneBed #This will reduce the amount of time it takes to find conversions print('Filtering bam for reads contained within regions of interest...') @@ -72,9 +81,9 @@ #Identify conversions if args.nproc == 1: - convs, readcounter = iteratereads_pairedend(filteredbam, args.onlyConsiderOverlap, args.use_g_t, args.use_g_c, snps, args.nConv, 'high') + convs, readcounter = iteratereads_pairedend(filteredbam, args.onlyConsiderOverlap, args.use_g_t, args.use_g_c, args.nConv, snps, maskpositions, 'high') elif args.nproc > 1: - convs = getmismatches(filteredbam, args.onlyConsiderOverlap, snps, args.nConv, args.nproc, args.use_g_t, args.use_g_c) + convs = getmismatches(filteredbam, args.onlyConsiderOverlap, snps, maskpositions, args.nConv, args.nproc, args.use_g_t, args.use_g_c) #Assign reads to genes From 09fe2163c15cc9e17f36f302095d5dbdcf3e9fa5 Mon Sep 17 00:00:00 2001 From: Matthew Taliaferro Date: Wed, 25 May 2022 10:37:01 -0600 Subject: [PATCH 014/108] fix qual score bug and add use read1 and/or read2 --- getmismatches.py | 33 ++++++++++++++++++++++++++------- pigpen.py | 16 ++++++++++++++-- 2 files changed, 40 insertions(+), 9 deletions(-) diff --git a/getmismatches.py b/getmismatches.py index 2e06715..ea044de 100644 --- a/getmismatches.py +++ b/getmismatches.py @@ -176,7 +176,7 @@ def findsnps(controlbams, genomefasta, minCoverage = 20, minVarFreq = 0.02): return snps -def iteratereads_pairedend(bam, onlyConsiderOverlap, use_g_t, use_g_c, nConv, snps=None, maskpositions=None, verbosity='high'): +def iteratereads_pairedend(bam, onlyConsiderOverlap, use_g_t, use_g_c, use_read1, use_read2, nConv, snps=None, maskpositions=None, verbosity='high'): #Iterate over reads in a paired end alignment file. #Find nt conversion locations for each read. #For locations interrogated by both mates of read pair, conversion must exist in both mates in order to count @@ -261,7 +261,7 @@ def iteratereads_pairedend(bam, onlyConsiderOverlap, use_g_t, use_g_c, nConv, sn read1qualities = list(read1.query_qualities) #phred scores read2qualities = list(read2.query_qualities) - convs_in_read = getmismatches_pairedend(read1alignedpairs, read2alignedpairs, read1queryseq, read2queryseq, read1qualities, read2qualities, read1strand, read2strand, masklocations, onlyConsiderOverlap, nConv, use_g_t, use_g_c) + convs_in_read = getmismatches_pairedend(read1alignedpairs, read2alignedpairs, read1queryseq, read2queryseq, read1qualities, read2qualities, read1strand, read2strand, masklocations, onlyConsiderOverlap, nConv, use_g_t, use_g_c, use_read1, use_read2) queriednts.append(sum(convs_in_read.values())) convs[queryname] = convs_in_read @@ -273,7 +273,7 @@ def iteratereads_pairedend(bam, onlyConsiderOverlap, use_g_t, use_g_c, nConv, sn return convs, readcounter -def getmismatches_pairedend(read1alignedpairs, read2alignedpairs, read1queryseq, read2queryseq, read1qualities, read2qualities, read1strand, read2strand, masklocations, onlyoverlap, nConv, use_g_t, use_g_c): +def getmismatches_pairedend(read1alignedpairs, read2alignedpairs, read1queryseq, read2queryseq, read1qualities, read2qualities, read1strand, read2strand, masklocations, onlyoverlap, nConv, use_g_t, use_g_c, use_read1, use_read2): #remove tuples that have None #These are either intronic or might have been soft-clipped #Tuples are (querypos, refpos, refsequence) @@ -296,7 +296,7 @@ def getmismatches_pairedend(read1alignedpairs, read2alignedpairs, read1queryseq, for ind, x in enumerate(read2alignedpairs): x += (read2qualities[ind],) read2ap_withq.append(x) - read2alignedpairs = read1ap_withq + read2alignedpairs = read2ap_withq #Now remove positions where refsequence is None #These may be places that got soft-clipped @@ -359,6 +359,21 @@ def getmismatches_pairedend(read1alignedpairs, read2alignedpairs, read1queryseq, else: mergedalignedpairs[refpos] = ['NA', r2querypos, 'NA', r2refseq, 'NA', r2quality] + #If we are only using read1 or only using read2, replace the positions in the non-used read with NA + for refpos in mergedalignedpairs: + r1querypos, r2querypos, r1refseq, r2refseq, r1quality, r2quality = mergedalignedpairs[refpos] + if use_read1 and not use_read2: + updatedlist = [r1querypos, 'NA', r1refseq, 'NA', r1quality, 'NA'] + mergedalignedpairs[refpos] = updatedlist + elif use_read2 and not use_read1: + updatedlist = ['NA', r2querypos, 'NA', r2refseq, 'NA', r2quality] + mergedalignedpairs[refpos] = updatedlist + elif not use_read1 and not use_read2: + print('ERROR: we have to use either read1 or read2, if not both.') + sys.exit() + elif use_read1 and use_read2: + pass + #Now go through mergedalignedpairs, looking for conversions. #For positions observed both in r1 and r2, a conversion must be present in both reads, @@ -456,6 +471,9 @@ def getmismatches_pairedend(read1alignedpairs, read2alignedpairs, read1queryseq, else: pass + elif r1querypos == 'NA' and r2querypos == 'NA': #if we are using only read1 or read2, it's possible for this position in both reads to be NA + continue + #Does the number of g_t and/or g_c conversions meet our threshold? if use_g_t and use_g_c: if convs['g_t'] + convs['g_c'] >= nConv: @@ -594,7 +612,7 @@ def split_bam(bam, nproc): return splitbams -def getmismatches(bam, onlyConsiderOverlap, snps, maskpositions, nConv, nproc, use_g_t, use_g_c): +def getmismatches(bam, onlyConsiderOverlap, snps, maskpositions, nConv, nproc, use_g_t, use_g_c, use_read1, use_read2): #Actually run the mismatch code (calling iteratereads_pairedend) #use multiprocessing #If there's only one processor, easier to use iteratereads_pairedend() directly. @@ -605,7 +623,7 @@ def getmismatches(bam, onlyConsiderOverlap, snps, maskpositions, nConv, nproc, u argslist = [] for x in splitbams: argslist.append((x, bool(onlyConsiderOverlap), bool( - use_g_t), bool(use_g_c), nConv, snps, maskpositions, 'low')) + use_g_t), bool(use_g_c), bool(use_read1), bool(use_read2), nConv, snps, maskpositions, 'low')) #items returned from iteratereads_pairedend are in a list, one per process totalreadcounter = 0 #number of reads across all the split bams @@ -635,6 +653,7 @@ def getmismatches(bam, onlyConsiderOverlap, snps, maskpositions, nConv, nproc, u if __name__ == '__main__': - iteratereads_pairedend(sys.argv[1], False, True, True, 1, None, 'high') + convs, readcounter = iteratereads_pairedend(sys.argv[1], False, True, True, True, False, 1, None, None, 'high') + summarize_convs(convs, sys.argv[2]) diff --git a/pigpen.py b/pigpen.py index a252028..44fa85d 100644 --- a/pigpen.py +++ b/pigpen.py @@ -29,6 +29,8 @@ parser.add_argument('--onlyConsiderOverlap', action = 'store_true', help = 'Only consider conversions seen in both reads of a read pair?') parser.add_argument('--use_g_t', action = 'store_true', help = 'Consider G->T conversions?') parser.add_argument('--use_g_c', action = 'store_true', help = 'Consider G->C conversions?') + parser.add_argument('--use_read1', action = 'store_true', help = 'Use read1 when looking for conversions?') + parser.add_argument('--use_read2', action = 'store_true', help = 'Use read2 when looking for conversions?') parser.add_argument('--nConv', type = int, help = 'Minimum number of required G->T and/or G->C conversions in a read pair in order for conversions to be counted. Default is 1.', default = 1) args = parser.parse_args() @@ -37,6 +39,16 @@ print('We have to either be looking for G->T or G->C, if not both! Add argument --use_g_t and/or --use_g_c.') sys.exit() + #We have to be using either read1 or read2 if not both + if not args.use_read1 and not args.use_read2: + print('We need to use read1 or read2, if not both! Add argument --use_read1 and/or --use_read2.') + sys.exit() + + #If we want to only consider overlap, we have to be using both read1 and read2 + if not args.onlyConsiderOverlap and not args.use_read1 or not args.use_read2: + print('If we are only going to consider overlap between paired reads, we must use both read1 and read2.') + sys.exit() + #Make index for bam if there isn't one already bamindex = args.bam + '.bai' if not os.path.exists(bamindex): @@ -81,9 +93,9 @@ #Identify conversions if args.nproc == 1: - convs, readcounter = iteratereads_pairedend(filteredbam, args.onlyConsiderOverlap, args.use_g_t, args.use_g_c, args.nConv, snps, maskpositions, 'high') + convs, readcounter = iteratereads_pairedend(filteredbam, args.onlyConsiderOverlap, args.use_g_t, args.use_g_c, args.use_read1, args.use_read2, args.nConv, snps, maskpositions, 'high') elif args.nproc > 1: - convs = getmismatches(filteredbam, args.onlyConsiderOverlap, snps, maskpositions, args.nConv, args.nproc, args.use_g_t, args.use_g_c) + convs = getmismatches(filteredbam, args.onlyConsiderOverlap, snps, maskpositions, args.nConv, args.nproc, args.use_g_t, args.use_g_c, args.use_read1, args.use_read2) #Assign reads to genes From 27d58ae387a0a1bdf51df370d5169b1779ce3230 Mon Sep 17 00:00:00 2001 From: Matthew Taliaferro Date: Mon, 6 Jun 2022 16:03:53 -0600 Subject: [PATCH 015/108] fix read1/2 bug in pigpen argparse --- pigpen.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/pigpen.py b/pigpen.py index 44fa85d..dc8ed83 100644 --- a/pigpen.py +++ b/pigpen.py @@ -45,7 +45,7 @@ sys.exit() #If we want to only consider overlap, we have to be using both read1 and read2 - if not args.onlyConsiderOverlap and not args.use_read1 or not args.use_read2: + if args.onlyConsiderOverlap and (not args.use_read1 or not args.use_read2): print('If we are only going to consider overlap between paired reads, we must use both read1 and read2.') sys.exit() From 20e01e174bf54823d3b9f0d4b7500c87c48815c5 Mon Sep 17 00:00:00 2001 From: Matthew Taliaferro Date: Tue, 5 Jul 2022 16:11:58 -0600 Subject: [PATCH 016/108] add alignment and quantification script --- alignAndQuant.py | 172 +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 172 insertions(+) create mode 100644 alignAndQuant.py diff --git a/alignAndQuant.py b/alignAndQuant.py new file mode 100644 index 0000000..7a5f9c4 --- /dev/null +++ b/alignAndQuant.py @@ -0,0 +1,172 @@ +import os +import subprocess +import sys +import shutil +import argparse + +#Given a pair of read files, align reads using STAR and quantify/align reads using salmon. +#This will make a STAR-produced bam (for pigpen mutation calling) and a salmon-produced bam (for read assignment). +#It will then run postmaster to append transcript assignments to the salmon-produced bam. + +#This is going to take in gzipped fastqs, a directory containing the STAR index for this genome, and a directory containing the salmon index for this genome. + +#Reads are aligned to the genome using STAR. This bam file will be used for mutation calling. Uniquely aligning reads from this alignment are then +#written to temporary fastq files (.unique.r1..fq.gz), which are then used for salmon and postmaster. + +#When runSTAR(), bamtofastq(), runSalmon(), and runPostmaster() are run in succession, the output is a file called .postmaster.bam in the postmaster/ +#and Aligned.sortedByCoord.out.bam in STAR/ and .quant.sf in salmon/ + +#Requires STAR, salmon(>= 1.9.0), and postmaster be in user's PATH. + +def runSTAR(reads1, reads2, nthreads, STARindex, samplename): + if not os.path.exists('STAR'): + os.mkdir('STAR') + + cwd = os.getcwd() + outdir = os.path.join(cwd, 'STAR') + + #Clean output directory if it already exists + if os.path.exists(outdir) and os.path.isdir(outdir): + shutil.rmtree(outdir) + + os.mkdir(outdir) + prefix = outdir + '/' + samplename + + command = ['STAR', '--runMode', 'alignReads', '--runThreadN', nthreads, '--genomeLoad', 'NoSharedMemory', '--genomeDir', STARindex, '--readFilesIn', reads1, reads2, '--readFilesCommand', + 'zcat', '--outFileNamePrefix', prefix, '--outSAMtype', 'BAM', 'SortedByCoordinate', '--outSAMstrandField', 'intronMotif', '-–outFilterMultimapNmax', '1', '--outSAMattributes', 'MD', 'NH'] + + print('Running STAR for {0}...'.format(samplename)) + + subprocess.call(command) + + print('Finished STAR for {0}!'.format(samplename)) + + +def bamtofastq(samplename, nthreads): + #Given a bam file of uniquely aligned reads (produced from runSTAR), rederive these reads as fastq in preparation for submission to salmon + if not os.path.exists('STAR'): + os.mkdir('STAR') + + cwd = os.getcwd() + outdir = os.path.join(cwd, 'STAR') + inbam = os.path.join(outdir, samplename + 'Aligned.sortedByCoord.out.bam') + sortedbam = os.path.join(outdir, 'temp.namesort.bam') + + #First sort bam file by readname + print('Sorting bam file by read name...') + command = ['samtools', 'collate', '--threads', nthreads, '-u', '-o', sortedbam, inbam] + subprocess.call(command) + print('Done!') + + #Now derive fastq + r1file = samplename + '.unique.r1.fq.gz' + r2file = samplename + '.unique.r2.fq.gz' + print('Writing fastq file of uniquely aligned reads for {0}...'.format(samplename)) + command = ['samtools', 'fastq', '--threads', nthreads, '-1', r1file, '-2', r2file, '-0', '/dev/null', '-s', '/dev/null', '-n', sortedbam] + subprocess.call(command) + print('Done writing fastq files for {0}!'.format(samplename)) + + os.remove(sortedbam) + + +def runSalmon(reads1, reads2, nthreads, salmonindex, samplename): + #Take in those uniquely aligning reads and quantify transcript abundance with them using salmon. + + if not os.path.exists('salmon'): + os.mkdir('salmon') + + idx = os.path.abspath(salmonindex) + r1 = os.path.abspath(reads1) + r2 = os.path.abspath(reads2) + + os.chdir('salmon') + + command = ['salmon', 'quant', '--libType', 'A', '-p', nthreads, '--seqBias', '--gcBias', + '--validateMappings', '-1', r1, '-2', r2, '-o', samplename, '--index', idx, '--writeMappings={0}.salmon.bam'.format(samplename), '--writeQualities'] + + print('Running salmon for {0}...'.format(samplename)) + + subprocess.call(command) + + #Move output + outputdir = os.path.join(os.getcwd(), samplename) + quantfile = os.path.join(outputdir, 'quant.sf') + movedquantfile = os.path.join(os.getcwd(), '{0}.quant.sf'.format(samplename)) + os.rename(quantfile, movedquantfile) + + #Remove uniquely aligning read files + os.remove(r1) + os.remove(r2) + + print('Finished salmon for {0}!'.format(samplename)) + +def runPostmaster(samplename, nthreads): + if not os.path.exists('postmaster'): + os.mkdir('postmaster') + + salmonquant = os.path.join(os.getcwd(), 'salmon', '{0}.quant.sf'.format(samplename)) + salmonbam = os.path.join(os.getcwd(), 'salmon', '{0}.salmon.bam'.format(samplename)) + + os.chdir('postmaster') + outputfile = os.path.join(os.getcwd(), '{0}.postmaster.bam'.format(samplename)) + + print('Running postmaster for {0}...'.format(samplename)) + command = ['postmaster', '--num-threads', nthreads, '--quant', salmonquant, '--alignments', salmonbam, '--output', outputfile] + subprocess.call(command) + + #Sort and index bam + with open(outputfile + '.sort', 'w') as sortedfh: + command = ['samtools', 'sort', '-@', nthreads, outputfile] + subprocess.run(command, stdout = sortedfh) + + os.rename(outputfile + '.sort', outputfile) + + command = ['samtools', 'index', outputfile] + subprocess.call(command) + + print('Finished postmaster for {0}!'.format(samplename)) + +def addMD(samplename, reffasta, nthreads): + inputbam = os.path.join(os.getcwd(), 'postmaster', '{0}.postmaster.bam'.format(samplename)) + command = ['samtools', 'calmd', '-b', '--threads', nthreads, inputbam, reffasta] + + print('Adding MD tags to {0}.postmaster.md.bam...'.format(samplename)) + with open(samplename + '.postmaster.md.bam', 'w') as outfile: + subprocess.run(command, stdout = outfile) + print('Finished adding MD tags to {0}.postmaster.md.bam!'.format(samplename)) + +if __name__ == '__main__': + parser = argparse.ArgumentParser(description = 'Align and quantify reads using STAR, salmon, and postmaster in preparation for analysis with PIGPEN.') + parser.add_argument('--forwardreads', type = str, help = 'Forward reads. Gzipped fastq.') + parser.add_argument('--reversereads', type = str, help = 'Reverse reads. Gzipped fastq.') + parser.add_argument('--nthreads', type = str, help = 'Number of threads to use for alignment and quantification.') + parser.add_argument('--STARindex', type = str, help = 'STAR index directory.') + parser.add_argument('--salmonindex', type = str, help = 'Salmon index directory.') + parser.add_argument('--samplename', type = str, help = 'Sample name. Will be appended to output files.') + args = parser.parse_args() + + r1 = os.path.abspath(args.forwardreads) + r2 = os.path.abspath(args.reversereads) + STARindex = os.path.abspath(args.STARindex) + salmonindex = os.path.abspath(args.salmonindex) + samplename = args.samplename + nthreads = args.nthreads + + wd = os.path.abspath(os.getcwd()) + sampledir = os.path.join(wd, samplename) + if os.path.exists(sampledir) and os.path.isdir(sampledir): + shutil.rmtree(sampledir) + os.mkdir(sampledir) + os.chdir(sampledir) + + #uniquely aligning read files + uniquer1 = samplename + '.unique.r1.fq.gz' + uniquer2 = samplename + '.unique.r2.fq.gz' + + runSTAR(r1, r2, nthreads, STARindex, samplename) + bamtofastq(samplename, nthreads) + runSalmon(uniquer1, uniquer2, nthreads, salmonindex, samplename) + os.chdir(sampledir) + runPostmaster(samplename, nthreads) + + From 10e1ed82152395a57ef7dbbfdda1389305e0aaf2 Mon Sep 17 00:00:00 2001 From: Matthew Taliaferro Date: Tue, 5 Jul 2022 16:28:02 -0600 Subject: [PATCH 017/108] typos in alignAndQuant --- alignAndQuant.py | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/alignAndQuant.py b/alignAndQuant.py index 7a5f9c4..e9da30f 100644 --- a/alignAndQuant.py +++ b/alignAndQuant.py @@ -13,8 +13,10 @@ #Reads are aligned to the genome using STAR. This bam file will be used for mutation calling. Uniquely aligning reads from this alignment are then #written to temporary fastq files (.unique.r1..fq.gz), which are then used for salmon and postmaster. -#When runSTAR(), bamtofastq(), runSalmon(), and runPostmaster() are run in succession, the output is a file called .postmaster.bam in the postmaster/ -#and Aligned.sortedByCoord.out.bam in STAR/ and .quant.sf in salmon/ +#When runSTAR(), bamtofastq(), runSalmon(), and runPostmaster() are run in succession, the output is a directory called . +#In this directory, the STAR output is Aligned.sortedByCoord.out.bam in STAR/, +#the salmon output is .quant.sf and .salmon.bam in salmon/, +#and the postmaster output is .postmaster.bam in postmaster/ #Requires STAR, salmon(>= 1.9.0), and postmaster be in user's PATH. From da59501e60b155476dc9e17b58d982df80098d73 Mon Sep 17 00:00:00 2001 From: Matthew Taliaferro Date: Tue, 5 Jul 2022 16:28:58 -0600 Subject: [PATCH 018/108] another stupid typo in alignAndQuant --- alignAndQuant.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/alignAndQuant.py b/alignAndQuant.py index e9da30f..5584680 100644 --- a/alignAndQuant.py +++ b/alignAndQuant.py @@ -14,7 +14,7 @@ #written to temporary fastq files (.unique.r1..fq.gz), which are then used for salmon and postmaster. #When runSTAR(), bamtofastq(), runSalmon(), and runPostmaster() are run in succession, the output is a directory called . -#In this directory, the STAR output is Aligned.sortedByCoord.out.bam in STAR/, +#In this directory, the STAR output is Aligned.sortedByCoord.out.bam in STAR/, #the salmon output is .quant.sf and .salmon.bam in salmon/, #and the postmaster output is .postmaster.bam in postmaster/ From 3a4477e756e8cc2040144600b8a8f5891165087c Mon Sep 17 00:00:00 2001 From: Matthew Taliaferro Date: Thu, 7 Jul 2022 10:37:56 -0600 Subject: [PATCH 019/108] update default snp variant freq --- snps.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/snps.py b/snps.py index 6374962..f17c3ff 100644 --- a/snps.py +++ b/snps.py @@ -13,7 +13,7 @@ def getSNPs(bams, genomefasta, minCoverage = 20, minVarFreq = 0.02): if not minCoverage: minCoverage = 20 if not minVarFreq: - minVarFreq = 0.02 + minVarFreq = 0.2 #if we already made a vcf, don't make another one if os.path.exists('merged.vcf'): From aed55d88b48bda59ea921ea0e86d8f4d13c8f479 Mon Sep 17 00:00:00 2001 From: Matthew Taliaferro Date: Thu, 7 Jul 2022 10:38:57 -0600 Subject: [PATCH 020/108] write conversions as pickled dictionary --- getmismatches.py | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/getmismatches.py b/getmismatches.py index ea044de..0e56cc4 100644 --- a/getmismatches.py +++ b/getmismatches.py @@ -269,7 +269,10 @@ def iteratereads_pairedend(bam, onlyConsiderOverlap, use_g_t, use_g_c, use_read1 print('Queried {0} read pairs in {1}.'.format(readcounter, os.path.basename(bam))) pysam.set_verbosity(save) - #Pickle and write convs? + #Pickle and write convs + with open('convs.pkl', 'wb') as outfh: + pickle.dump(convs, outfh) + return convs, readcounter @@ -653,7 +656,7 @@ def getmismatches(bam, onlyConsiderOverlap, snps, maskpositions, nConv, nproc, u if __name__ == '__main__': - convs, readcounter = iteratereads_pairedend(sys.argv[1], False, True, True, True, False, 1, None, None, 'high') - summarize_convs(convs, sys.argv[2]) + convs, readcounter = iteratereads_pairedend(sys.argv[1], False, True, True, True, True, 1, None, None, 'high') + #summarize_convs(convs, sys.argv[2]) From 71eb0edf132e2bc6a5a601a48296d66f13872b8f Mon Sep 17 00:00:00 2001 From: Matthew Taliaferro Date: Thu, 7 Jul 2022 14:05:21 -0600 Subject: [PATCH 021/108] index STAR bam after creating it --- alignAndQuant.py | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/alignAndQuant.py b/alignAndQuant.py index 5584680..b25eca7 100644 --- a/alignAndQuant.py +++ b/alignAndQuant.py @@ -41,6 +41,14 @@ def runSTAR(reads1, reads2, nthreads, STARindex, samplename): subprocess.call(command) + #make index + bam = os.path.join(outdir, samplename + 'Aligned.sortedByCoord.out.bam') + bamindex = bam + '.bai' + if not os.path.exists(bamindex): + indexCMD = 'samtools index ' + bam + index = subprocess.Popen(indexCMD, shell=True) + index.wait() + print('Finished STAR for {0}!'.format(samplename)) From 5c900b51dc1dfbf14692c9f8112153134daee419 Mon Sep 17 00:00:00 2001 From: Matthew Taliaferro Date: Thu, 7 Jul 2022 15:16:35 -0600 Subject: [PATCH 022/108] add assignreads_salmon --- assignreads_salmon.py | 258 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 258 insertions(+) create mode 100644 assignreads_salmon.py diff --git a/assignreads_salmon.py b/assignreads_salmon.py new file mode 100644 index 0000000..755243b --- /dev/null +++ b/assignreads_salmon.py @@ -0,0 +1,258 @@ +import os +import sys +import pysam +import pickle +import gffutils +import numpy as np + + +#Take in a dictionary of {readid : conversions} (made by getmismatches.py) and a postmaster-enhanced bam (made by alignAndQuant.py). +#First, construct dictionary of {readid : {txid : fractional assignment}}. Then, combining this dictionary with the previous one, +#count the number of conversions associated with each transcript. Finally (and I guess optionally), using a genome annotation file, +#collapse transcript level conversion counts to gene-level conversion counts. + +def getpostmasterassignments(postmasterbam): + #Given a postmaster-produced bam, make a dictionary of the form {readid : {txid : fractional assignment}} + #It looks like in a postmaster bam that paired end reads are right after each other and are always + #given the same fractional assignments. This means we can probably just consider R1 reads. + + pprobs = {} #{readid : {txid : pprob}} + + with pysam.AlignmentFile(postmasterbam, 'r') as bamfh: + for read in bamfh.fetch(until_eof = True): + if read.is_read2: + continue + readid = read.query_name + tx = read.reference_name + pprob = read.get_tag(tag='ZW') + if readid not in pprobs: + pprobs[readid] = {} + pprobs[readid][tx] = pprob + + return pprobs + +def assigntotxs(pprobs, convs): + #Intersect posterior probabilities of read assignments to transcripts with conversion counts of those reads. + #The counts assigned to a tx by a read are scaled by the posterior probability that a read came from that transcript. + + #pprobs = #{readid : {txid : pprob}} + #produced from getpostmasterassignments() + #convs = #{readid : {a_a : 200, a_t : 1, etc.}} + print('Finding transcript assignments for {0} reads.'.format(len(convs))) + readswithoutassignment = 0 #number of reads which exist in convs but not in pprobs (i.e. weren't assigned to a transcript by salmon) + + txconvs = {} # {txid : {a_a : 200, a_t : 1, etc.}} + + for readid in pprobs: + + try: + readconvs = convs[readid] + except KeyError: #we couldn't find this read in convs + readswithoutassignment +=1 + continue + + for txid in pprobs[readid]: + if txid not in txconvs: + txconvs[txid] = {} + pprob = pprobs[readid][txid] + for conv in readconvs: + scaledconv = readconvs[conv] * pprob + txconvs[txid][conv] = scaledconv + + readswithtxs = len(convs) - readswithoutassignment + pct = round(readswithtxs / len(convs), 2) * 100 + print('Found transcripts for {0} of {1} reads ({2}%).'.format(readswithtxs, len(convs), pct)) + + return txconvs + +def collapsetogene(txconvs, gff): + #Collapse tx-level count measurements to gene level. + #Need to relate transcripts and genes. Do that with the supplied gff annotation. + #txconvs = {txid : {a_a : 200, a_t : 1, etc.}} + + tx2gene = {} #{txid : geneid} + geneid2genename = {} #{geneid : genename} + geneconvs = {} # {geneid : {a_a : 200, a_t : 1, etc.}} + + print('Indexing gff..') + gff_fn = gff + db_fn = os.path.abspath(gff_fn) + '.db' + if os.path.isfile(db_fn) == False: + gffutils.create_db(gff_fn, db_fn, merge_strategy='merge', verbose=True) + print('Done indexing!') + + db = gffutils.FeatureDB(db_fn) + genes = db.features_of_type('gene') + + print('Connecting transcripts and genes...') + for gene in genes: + geneid = str(gene.id).split('.')[0] #remove version numbers + genename = gene.attributes['gene_name'][0] + geneid2genename[geneid] = genename + for tx in db.children(gene, featuretype = 'transcript'): + txid = str(tx.id).split('.')[0] + tx2gene[txid] = geneid + print('Done!') + + allgenes = list(set(tx2gene.values())) + + #Initialize geneconvs dictionary + possibleconvs = [ + 'a_a', 'a_t', 'a_c', 'a_g', 'a_n', + 'g_a', 'g_t', 'g_c', 'g_g', 'g_n', + 'c_a', 'c_t', 'c_c', 'c_g', 'c_n', + 't_a', 't_t', 't_c', 't_g', 't_n'] + + for gene in allgenes: + geneconvs[gene] = {} + for conv in possibleconvs: + geneconvs[gene][conv] = 0 + + for tx in txconvs: + try: + gene = tx2gene[tx] + except KeyError: + print('WARNING: transcript {0} doesn\'t belong to a gene in the supplied annotation.'.format(tx)) + continue + convs = txconvs[tx] + for conv in convs: + convcount = txconvs[tx][conv] + geneconvs[gene][conv] += convcount + + return tx2gene, geneid2genename, geneconvs + +def readspergene(quantsf, tx2gene): + #Get the number of reads assigned to each tx. This can simply be read from the salmon quant.sf file. + #Then, sum read counts across all transcripts within a gene. + #Transcript and gene relationships were derived by collapsetogene(). + + txcounts = {} #{txid : readcounts} + genecounts = {} #{geneid : readcounts} + + with open(quantsf, 'r') as infh: + for line in infh: + line = line.strip().split('\t') + if line[0] == 'Name': + continue + txid = line[0] + counts = float(line[4]) + txcounts[txid] = counts + + allgenes = list(set(tx2gene.values())) + for gene in allgenes: + genecounts[gene] = 0 + + for txid in txcounts: + geneid = tx2gene[txid] + genecounts[geneid] += txcounts[txid] + + return genecounts + + +def writeOutput(geneconvs, genecounts, geneid2genename, outfile, use_g_t, use_g_c): + #Write number of conversions and readcounts for genes. + possibleconvs = [ + 'a_a', 'a_t', 'a_c', 'a_g', 'a_n', + 'g_a', 'g_t', 'g_c', 'g_g', 'g_n', + 'c_a', 'c_t', 'c_c', 'c_g', 'c_n', + 't_a', 't_t', 't_c', 't_g', 't_n'] + + with open(outfile, 'w') as outfh: + #total G is number of ref Gs encountered + #convG is g_t + g_c (the ones we are interested in) + outfh.write(('\t').join(['GeneID', 'GeneName', 'numreads'] + possibleconvs + [ + 'totalG', 'convG', 'convGrate', 'G_Trate', 'G_Crate', 'porc']) + '\n') + genes = sorted(geneconvs.keys()) + + for gene in genes: + genename = geneid2genename[gene] + numreads = genecounts[gene] + convcounts = [] + c = geneconvs[gene] + for conv in possibleconvs: + convcount = c[conv] + convcounts.append(convcount) + + convcounts = ['{:.2f}'.format(x) for x in convcounts] + + totalG = c['g_g'] + c['g_c'] + c['g_t'] + c['g_a'] + c['g_n'] + if use_g_t and use_g_c: + convG = c['g_c'] + c['g_t'] + elif use_g_c and not use_g_t: + convG = c['g_c'] + elif use_g_t and not use_g_c: + convG = c['g_t'] + elif not use_g_t and not use_g_c: + print('ERROR: we have to be counting either G->T or G->C, if not both!') + sys.exit() + + g_ccount = c['g_c'] + g_tcount = c['g_t'] + + totalmut = c['a_t'] + c['a_c'] + c['a_g'] + c['g_t'] + c['g_c'] + c['g_a'] + c['t_a'] + c['t_c'] + c['t_g'] + c['c_t'] + c['c_g'] + c['c_a'] + totalnonmut = c['a_a'] + c['g_g'] + c['c_c'] + c['t_t'] + allnt = totalmut + totalnonmut + + try: + convGrate = convG / totalG + except ZeroDivisionError: + convGrate = 'NA' + + try: + g_crate = g_ccount / totalG + except ZeroDivisionError: + g_crate = 'NA' + + try: + g_trate = g_tcount / totalG + except ZeroDivisionError: + g_trate = 'NA' + + try: + totalmutrate = totalmut / allnt + except ZeroDivisionError: + totalmutrate = 'NA' + + #normalize convGrate to rate of all mutations + #Proportion Of Relevant Conversions + if totalmutrate == 'NA': + porc = 'NA' + elif totalmutrate > 0: + try: + porc = np.log2(convGrate / totalmutrate) + except: + porc = 'NA' + else: + porc = 'NA' + + #Format numbers for printing + if type(convGrate) == float: + convGrate = '{:.2e}'.format(convGrate) + if type(g_trate) == float: + g_trate = '{:.2e}'.format(g_trate) + if type(g_crate) == float: + g_crate = '{:.2e}'.format(g_crate) + if type(porc) == np.float64: + porc = '{:.3f}'.format(porc) + + outfh.write(('\t').join([gene, genename, str(numreads)] + convcounts + [str(totalG), str(convG), str(convGrate), str(g_trate), str(g_crate), str(porc)]) + '\n') + + + + + +if __name__ == '__main__': + print('Getting posterior probabilities from salmon alignment file...') + pprobs = getpostmasterassignments(sys.argv[1]) + print('Done!') + print('Loading conversions from pickle file...') + with open(sys.argv[2], 'rb') as infh: + convs = pickle.load(infh) + print('Done!') + print('Assinging conversions to transcripts...') + txconvs = assigntotxs(pprobs, convs) + print('Done!') + + tx2gene, geneid2genename, geneconvs = collapsetogene(txconvs, sys.argv[3]) + genecounts = readspergene(sys.argv[4], tx2gene) + writeOutput(geneconvs, genecounts, geneid2genename, sys.argv[5], True, True) \ No newline at end of file From 7f584f3ae45a5eb90160b4d2e9d462b748992ac0 Mon Sep 17 00:00:00 2001 From: Matthew Taliaferro Date: Thu, 7 Jul 2022 15:17:24 -0600 Subject: [PATCH 023/108] do not write conversions to pickled dict --- getmismatches.py | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/getmismatches.py b/getmismatches.py index 0e56cc4..b1290e3 100644 --- a/getmismatches.py +++ b/getmismatches.py @@ -270,8 +270,8 @@ def iteratereads_pairedend(bam, onlyConsiderOverlap, use_g_t, use_g_c, use_read1 pysam.set_verbosity(save) #Pickle and write convs - with open('convs.pkl', 'wb') as outfh: - pickle.dump(convs, outfh) + #with open('convs.pkl', 'wb') as outfh: + #pickle.dump(convs, outfh) return convs, readcounter From e4e0d59e766ef73bc8379d8913e888d6a5b94fd7 Mon Sep 17 00:00:00 2001 From: Matthew Taliaferro Date: Fri, 8 Jul 2022 13:45:17 -0600 Subject: [PATCH 024/108] remove looking for an existing merged.vcf --- snps.py | 12 +++--------- 1 file changed, 3 insertions(+), 9 deletions(-) diff --git a/snps.py b/snps.py index f17c3ff..ee9a102 100644 --- a/snps.py +++ b/snps.py @@ -9,17 +9,12 @@ import sys #This will take in a list of bams and identify variants, creating vcf files for each -def getSNPs(bams, genomefasta, minCoverage = 20, minVarFreq = 0.02): +def getSNPs(bams, genomefasta, minCoverage = 20, minVarFreq = 0.2): if not minCoverage: minCoverage = 20 if not minVarFreq: minVarFreq = 0.2 - #if we already made a vcf, don't make another one - if os.path.exists('merged.vcf'): - print('A merged vcf files already exists! Not making another one...') - return None - vcfFileNames = [] for bam in bams: @@ -53,11 +48,10 @@ def getSNPs(bams, genomefasta, minCoverage = 20, minVarFreq = 0.02): with open('vcfconcat.log', 'w') as logfh: vcfFileNames = vcfFileNames * 2 vcfFiles = ' '.join(vcfFileNames) - print(vcfFiles) - concatCMD = 'bcftools merge --force-samples -m snps -O z --output merged.vcf ' + vcfFiles + concatCMD = 'bcftools merge --force-samples -m snps -O z --filter-logic x --output merged.vcf ' + vcfFiles concat = subprocess.Popen(concatCMD, shell = True, stderr = logfh) concat.wait() - filetorecord = vcfFileNames[0] + vcfFileNames = [vcfFileNames[0]] elif len(vcfFileNames) > 1: with open('vcfconcat.log', 'w') as logfh: vcfFiles = ' '.join(vcfFileNames) From 68493bb7c1bdef57807ad09cdb98fe39e53ff189 Mon Sep 17 00:00:00 2001 From: Matthew Taliaferro Date: Fri, 8 Jul 2022 13:45:35 -0600 Subject: [PATCH 025/108] update README --- README.md | 120 ++++++++++++++++++++++++++++++++++++++++++------------ 1 file changed, 94 insertions(+), 26 deletions(-) diff --git a/README.md b/README.md index 5d696b1..7529d2a 100644 --- a/README.md +++ b/README.md @@ -1,13 +1,5 @@ # OINC-seq

Detecting oxidative marks on RNA using high-throughput sequencing -## Overview - -OINC-seq (Oxidation-Induced Nucleotide Conversion sequencing) is a sequencing technology that allows the direction of oxidative marks on RNA molecules. Because guanosine has the lowest redox potential of any of the ribonucleosides, it is the one most likely to be affected by oxidation. When this occurs, guanosine is turned into 8-oxoguanosine (8-OG). A previous [study](https://pubs.acs.org/doi/10.1021/acs.biochem.7b00730) found that when reverse transcriptase encounters guanosine oxidation products, it can misinterpret 8-OG as either T or C. Therefore, to detect these oxidative marks, one can look for G -> T and G -> C conversions in RNAseq data. - -To detect and quantify these conversions, we have created software called **PIGPEN** (Pipeline for Identification of Guanosine Positions Erroneously Notated). - -PIGPEN takes in alignment files (bam), ideally made with [STAR](https://github.com/alexdobin/STAR). Single and paired-end reads are supported, although paired-end reads are preferred (for reasons that will become clear later). To minimize the contribution of positions that appear as mutations due to non-ideal alignments, PIGPEN only considers uniquely aligned reads (mapping quality == 255). For now, it is required that paired-end reads be stranded, and that read 1 correspond to the sense strand. This is true for most, but not all, modern RNAseq library preparation protocols. - ,-,-----, PIGPEN **** \ \ ),)`-' <`--'> \ \` @@ -22,20 +14,34 @@ PIGPEN takes in alignment files (bam), ideally made with [STAR](https://github.c Of Guanosine Positions Erroneously Notated +## Overview + +OINC-seq (Oxidation-Induced Nucleotide Conversion sequencing) is a sequencing technology that allows the direction of oxidative marks on RNA molecules. Because guanosine has the lowest redox potential of any of the ribonucleosides, it is the one most likely to be affected by oxidation. When this occurs, guanosine is turned into 8-oxoguanosine (8-OG). A previous [study](https://pubs.acs.org/doi/10.1021/acs.biochem.7b00730) found that when reverse transcriptase encounters guanosine oxidation products, it can misinterpret 8-OG as either T or C. Therefore, to detect these oxidative marks, one can look for G -> T and G -> C conversions in RNAseq data. + +To detect and quantify these conversions, we have created software called **PIGPEN** (Pipeline for Identification of Guanosine Positions Erroneously Notated). + +PIGPEN starts with RNAseq fastq files. These files are aligned to the genome using [STAR](https://github.com/alexdobin/STAR). Single and paired-end reads are supported, although paired-end reads are preferred (for reasons that will become clear later). To minimize the contribution of positions that appear as mutations due to non-ideal alignments, PIGPEN only considers uniquely aligned reads (mapping quality == 255). For now, it is required that paired-end reads be stranded, and that read 1 correspond to the sense strand. This is true for most, but not all, modern RNAseq library preparation protocols. + +Uniquely aligned reads are then extracted and used to quantify transcript abundances using [salmon](https://combine-lab.github.io/salmon/). Posterior probabilities of transcript assignments are then derived using [postmaster](https://github.com/COMBINE-lab/postmaster). `STAR`, `salmon`, and `postmaster` must be in the user's `$PATH`. All three of these preparatory steps can be easily and automatically done using `alignAndQuant.py`. + +Following the creation of alignment files produced by `STAR` and `postmaster` as well as transcript quantifications produced by `salmon`, these files are then used by `pigpen.py` to identify nucleotide conversions, assign them to transcripts and genes, and then quantify the number of conversions in each gene. A graphical overview of the flow of `PIGPEN` is shown below. + +![alt text](https://images.squarespace-cdn.com/content/v1/591d9c8cbebafbf01b1e28f9/f4a15b89-b3f1-4a10-84fc-5e669594f4e4/updatedPIGPENscheme.png?format=1500w "PIGPEN overview") + ## Requirements PIGPEN has the following prerequisites: - python >= 3.6 -- samtools >= 1.13 +- samtools >= 1.15 - varscan >= 2.4.4 -- bcftools >= 1.13 -- pybedtools >= 0.8.2 -- pysam >= 0.16 +- bcftools >= 1.15 +- pysam >= 0.19 - numpy >= 1.21 -- pandas >= 1.3.3 -- bamtools >= 2.5.1 -- bedtools >= 2.30.0 +- pandas >= 1.3.5 +- bamtools >= 2.5.2 +- salmon >= 1.9.0 +- gffutils >= 0.11.0 BACON has the following prerequisites: @@ -49,19 +55,69 @@ BACON has the following prerequisites: For now, installation can be done by cloning this repository. As PIPGEN matures, we will work towards getting this package on [bioconda](https://bioconda.github.io/). +## Preparing alignment files + +`pigpen.py` expects a particular directory structure for organization of `STAR`, `salmon`, and `postmaster` outputs. This is represented below. + +``` +workingdir +│ +└───sample1 +│ │ +│ └───STAR +│ │ │ sample1Aligned.sortedByCoord.out.bam +│ │ │ sample1Aligned.sortedByCoord.out.bam.bai +│ │ │ ... +│ │ +│ └───salmon +│ │ │ sample1.quant.sf +│ │ │ sample1.salmon.bam +│ │ │ ... +│ │ +│ └───postmaster +│ │ │ sample1.postmaster.bam +│ │ │ sample1.postmaster.bam.bai +│ │ │ ... +│ +└───sample2 +│ │ +│ └───STAR +│ │ │ sample2Aligned.sortedByCoord.out.bam +│ │ │ sample2Aligned.sortedByCoord.out.bam.bai +│ │ │ ... +│ │ +│ └───salmon +│ │ │ sample2.quant.sf +│ │ │ sample2.salmon.bam +│ │ │ ... +│ │ +│ └───postmaster +│ │ │ sample2.postmaster.bam +│ │ │ sample2.postmaster.bam.bai +│ │ │ ... +... +``` + +This structure can be automatically acheived by running `alignAndQuant.py` in `workingdir` once for each sample. Following this, the samples are ready to be analyzed with `pigpen.py`. + +For example: + +`python alignAndQuant.py --forwardreads reads.r1.fq.gz --reversereads reads.r2.fq.gz --nthreads 32 --STARindex --salmonindex --samplename sample1` + +`STARindex` and `salmonindex` should be created according to the instructions for creating them found [here](https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf) and [here](https://salmon.readthedocs.io/en/latest/). + +## Running PIGPEN + +Samples are then ready for analysis with `pigpen.py`. From `workingdir`, a comma-separated list of samples is supplied to `--samplenames`. In the example above, this would be `--samplenames sample1,sample2`. Optionally, a list of control samples are provided to `--controlsamples`. These should correspond to samples in which nucleotide conversions were not intentionally induced. They serve as controls for SNP identification (see below). They may be a subset of the samples provided to `--samplenames`. + ## SNPs 8-OG-induced conversions are rare, and this rarity makes it imperative that contributions from conversions that are not due to oxidation are minimized. A major source of apparent conversions is SNPs. It is therefore advantageous to find and mask SNPs in the data. -PIGPEN performs this by using [varscan](http://varscan.sourceforge.net/using-varscan.html) to find SNP positions. These locations are then excluded from all future analyses. Varscan parameters are controled by the PIGPEN parameters `--SNPcoverage` and `--SNPfreq` that control the depth and frequency required to call a SNP. We recommend being aggressive with these parameters. We often set them to 20 and 0.02, respectively. - -PIGPEN performs this SNP calling on control alignment files (`--controlBams`) in which the intended oxidation did not occur. PIGPEN will use the union of all SNPs found in these files for masking. Whether or not to call SNPs at all (you probably should) is controlled by `--useSNPs`. - -This process can be time consuming. At the end, a file called **merged.vcf** is created in the current working directory. If this file is present, PIGPEN will assume that it should be used for SNP masking, allowing the process of identifying SNPs to be skipped. +PIGPEN performs this by using [varscan](http://varscan.sourceforge.net/using-varscan.html) to find SNP positions. These locations are then excluded from all future analyses. Varscan parameters are controled by the PIGPEN parameters `--SNPcoverage` and `--SNPfreq` that control the depth and frequency required to call a SNP. We recommend being aggressive with these parameters. We often set them to 20 and 0.2, respectively. -## Filtering alignments +PIGPEN performs this SNP calling on control samples (`--controlsamples`) in which the intended oxidation did not occur. PIGPEN will use the union of all SNPs found in these files for masking. Whether or not to call SNPs at all (you probably should) is controlled by `--useSNPs`. -Because the process of finding nucleotide conversions can take a long time, PIGPEN first filters the reads, keeping only those that overlap with any feature in a supplied bed file (`--geneBed`). This process can use multiple processors (`--nproc`) to speed it up, and requires another file (`--chromsizes`). This file is a 2 column, tab-delimited text file where column 1 is the reference (chromosome) names for the references present in the alignment file, and column 2 is the integer size of that reference. If a fasta file and fasta index for the genome exists, this file can be made using `cut -f 1,2 genome.fa.fai`. ## Quantifying conversions @@ -69,17 +125,29 @@ PIGPEN then identifies conversions in reads. This can be done using multiple pro First, `--onlyConsiderOverlap` requires that the same conversion be observed in both reads of a mate pair. Positions interrogated by only one read are not considered. This can improve accuracy. True oxidation-induced conversions are rare. Rare enough that sequencing errors can cause a problem. Requiring that a conversion be present in both reads minimizes the effect of sequencing errors. If the fragment sizes for a library are especially large relative to the read length, the number of positions interrogated by both mates will be small. -Second, `--requireMultipleConv` requires that there be at least two G -> C / G -> T conversions in a read pair in order for those conversions to be recorded. The rationale here is again to reduce the contribution of background, non-oxidation-related conversions. Background conversions should be distributed relatively randomly across reads. However, due to the spatial nature of the oxidation reaction, oxidation-induced conversions should be more clustered into specific reads. Therefore, requiring at least two conversions can increase specificity. In practice, this works well if the data is very deep or concentrated on a small number of targets. When dealing with transcriptome-scale data, this flag often reduces the number of observed conversions to an unacceptably low level. +Second, `nConv` sets the minimum number of G -> C / G -> T conversions in a read pair in order for those conversions to be recorded. The rationale here is again to reduce the contribution of background, non-oxidation-related conversions. Background conversions should be distributed relatively randomly across reads. However, due to the spatial nature of the oxidation reaction, oxidation-induced conversions should be more clustered into specific reads. Therefore, requiring at least two conversions can increase specificity. In practice, this works well if the data is very deep or concentrated on a small number of targets. When dealing with transcriptome-scale data, this flag often reduces the number of observed conversions to an unacceptably low level. ## Assigning reads to genes -For now, PIGPEN using `bedtools` and a supplied bed file of gene locations (`--geneBed`) to assign individual reads to genes. We are working on improvements in this area. +After PIGPEN calculates the number of converted and noncoverted nucleotides in each read pair, it intersects that data with the probabilistic transcript assignment for each read performed by `salmon` and `postmaster`. Conversions within read pair X are assigned proportionally to transcript Y according to the `salmon`/`postmaster`-calculated probability that read pair X originated from transcript Y. This transcript-level data is then collapsed to gene-level data according to the transcript/gene relationships found in `--gff`. Transcript IDs in `--gff` should match those in the fasta file used to make `--salmonindex`. The use of [GENCODE](www.gencodegenes.org) annotations is recommended if possible. ## Calculating the number of conversions per gene -After identifying the conversions present in each read and the cognate gene for each read, the number of conversions for each gene is calculated. We have observed that the overall rate of conversions (not just G -> T + G -> C, but all conversions) can vary signficantly from sample to sample, presumably due to a technical effect in library preparation. For this reason, PIGPEN calculates **PORC** (Proportion of Relevant Conversions) values. This is the log2 ratio of the relevant conversion rate ([G -> T + G -> C] / total number of reference G encountered) to the overall conversion rate (total number of all conversions / total number of positions interrogated). PORC therefore normalizes to the overall rate of conversions, removing this technical effect. +We have observed that the overall rate of conversions (not just G -> T + G -> C, but all conversions) can vary signficantly from sample to sample, presumably due to a technical effect in library preparation. For this reason, PIGPEN calculates **PORC** (Proportion of Relevant Conversions) values. This is the log2 ratio of the relevant conversion rate ([G -> T + G -> C] / total number of reference G encountered) to the overall conversion rate (total number of all conversions / total number of positions interrogated). PORC therefore normalizes to the overall rate of conversions, removing this technical effect. + +PIGPEN can use G -> T conversions, G -> C conversions, or both when calculating PORC values. This behavior is controlled by supplying the options `--use_g_t` and `--use_g_c`. To consider both types of conversions, supply both flags. + +## Using one read of a paired end sample + +The use of one read in a paired end sample for conversion quantification can be controlled using `--use_read1` and `--use_read2`. To use both reads, supply both flags. `--onlyConsiderOverlap` requires the use of both reads. Importantly, both reads are still used for genomic alignment and transcript quantification. + +## Mask specific positions + +To prevent specific genomic locations from being considered during conversion quantification, supply a bed file of these locations to `--maskbed`. + +## Output -PIGPEN can use G -> T conversions, G -> C conversions, or both when calculating PORC values. This behavior is controlled by supplying the options `--use_g_t` and `--use_g_c`. +Output files are named ``.pigpen.txt. These files contain the number of observed conversions for each gene as well as derived values like conversion rates and PORC values. ## Statistical framework for comparing gene-level PORC values across conditions From cf6551652c6212ac9a72bee00566d82b9504a3e8 Mon Sep 17 00:00:00 2001 From: Matthew Taliaferro Date: Fri, 8 Jul 2022 13:46:25 -0600 Subject: [PATCH 026/108] major reorg including incorporation of salmon --- pigpen.py | 118 ++++++++++++++++++++++++++++-------------------------- 1 file changed, 62 insertions(+), 56 deletions(-) diff --git a/pigpen.py b/pigpen.py index dc8ed83..f9d14be 100644 --- a/pigpen.py +++ b/pigpen.py @@ -7,25 +7,20 @@ import sys from snps import getSNPs, recordSNPs from maskpositions import readmaskbed -from filterbam import intersectreads, filterbam, intersectreads_multiprocess from getmismatches import iteratereads_pairedend, getmismatches -from assignreads import getReadOverlaps, processOverlaps -from conversionsPerGene import getPerGene, writeConvsPerGene -import pickle +from assignreads_salmon import getpostmasterassignments, assigntotxs, collapsetogene, readspergene, writeOutput if __name__ == '__main__': parser = argparse.ArgumentParser(description=' ,-,-----,\n PIGPEN **** \\ \\ ),)`-\'\n <`--\'> \\ \\` \n /. . `-----,\n OINC! > (\'\') , @~\n `-._, ___ /\n-|-|-|-|-|-|-|-| (( / (( / -|-|-| \n|-|-|-|-|-|-|-|- \'\'\' \'\'\' -|-|-|-\n-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|\n\n Pipeline for Identification \n Of Guanosine Positions\n Erroneously Notated', formatter_class = argparse.RawDescriptionHelpFormatter) - parser.add_argument('--bam', type = str, help = 'Aligned reads (ideally STAR uniquely aligned reads) to quantify', required = True) - parser.add_argument('--controlBams', type = str, help = 'Comma separated list of alignments from control samples (i.e. those where no *induced* conversions are expected. Required if SNPs are to be considered.') + parser.add_argument('--samplenames', type = str, help = 'Comma separated list of samples to quantify.', required = True) + parser.add_argument('--controlsamples', type = str, help = 'Comma separated list of control samples (i.e. those where no *induced* conversions are expected). May be a subset of samplenames. Required if SNPs are to be considered.') + parser.add_argument('--gff', type = str, help = 'Genome annotation in gff format.') parser.add_argument('--genomeFasta', type = str, help = 'Genome sequence in fasta format. Required if SNPs are to be considered.') - parser.add_argument('--geneBed', type = str, help = 'Bed file of genomic regions to quantify. Fourth field must be gene ID.') - parser.add_argument('--chromsizes', type = str, help = 'Tab-delimited file of chromosomes in the order the appear in the bed/bam and their sizes. Can be made with cut -f 1,2 genome.fa.fai') - parser.add_argument('--output', type = str, help = 'Output file of conversion rates for each gene.') parser.add_argument('--nproc', type = int, help = 'Number of processors to use. Default is 1.', default = 1) parser.add_argument('--useSNPs', action = 'store_true', help = 'Consider SNPs?') parser.add_argument('--maskbed', help = 'Optional. Bed file of positions to mask from analysis.', default = None) parser.add_argument('--SNPcoverage', type = int, help = 'Minimum coverage to call SNPs. Default = 20', default = 20) - parser.add_argument('--SNPfreq', type = float, help = 'Minimum variant frequency to call SNPs. Default = 0.02', default = 0.02) + parser.add_argument('--SNPfreq', type = float, help = 'Minimum variant frequency to call SNPs. Default = 0.2', default = 0.2) parser.add_argument('--onlyConsiderOverlap', action = 'store_true', help = 'Only consider conversions seen in both reads of a read pair?') parser.add_argument('--use_g_t', action = 'store_true', help = 'Consider G->T conversions?') parser.add_argument('--use_g_c', action = 'store_true', help = 'Consider G->C conversions?') @@ -34,6 +29,24 @@ parser.add_argument('--nConv', type = int, help = 'Minimum number of required G->T and/or G->C conversions in a read pair in order for conversions to be counted. Default is 1.', default = 1) args = parser.parse_args() + #Take in list of samplenames to run pigpen on + #Derive quant.sf, STAR bams, and postmaster bams + samplenames = args.samplenames.split(',') + salmonquants = [os.path.join(x, 'salmon', '{0}.quant.sf'.format(x)) for x in samplenames] + starbams = [os.path.join(x, 'STAR', '{0}Aligned.sortedByCoord.out.bam'.format(x)) for x in samplenames] + postmasterbams = [os.path.join(x, 'postmaster', '{0}.postmaster.bam'.format(x)) for x in samplenames] + + #Take in list of control samples, make list of their corresponding star bams for SNP calling + controlsamples = args.controlsamples.split(',') + controlindicies = [] + for ind, x in enumerate(samplenames): + if x in controlsamples: + controlindicies.append(ind) + + controlstarbams = [] + for x in controlindicies: + controlstarbams.append(starbams[x]) + #We have to be either looking for G->T or G->C, if not both if not args.use_g_t and not args.use_g_c: print('We have to either be looking for G->T or G->C, if not both! Add argument --use_g_t and/or --use_g_c.') @@ -48,28 +61,23 @@ if args.onlyConsiderOverlap and (not args.use_read1 or not args.use_read2): print('If we are only going to consider overlap between paired reads, we must use both read1 and read2.') sys.exit() - - #Make index for bam if there isn't one already - bamindex = args.bam + '.bai' - if not os.path.exists(bamindex): - indexCMD = 'samtools index ' + args.bam - index = subprocess.Popen(indexCMD, shell = True) - index.wait() #Make vcf file for snps if args.useSNPs: - controlbams = args.controlBams.split(',') - - #Make index for each control bam if there isn't one already - for bam in controlbams: - bamindex = bam + '.bai' - if not os.path.exists(bamindex): - indexCMD = 'samtools index ' + bam - index = subprocess.Popen(indexCMD, shell = True) - index.wait() - - vcfFileNames = getSNPs(controlbams, args.genomeFasta, args.SNPcoverage, args.SNPfreq) - snps = recordSNPs('merged.vcf') + if not os.path.exists('snps'): + os.mkdir('snps') + vcfFileNames = getSNPs(controlstarbams, args.genomeFasta, args.SNPcoverage, args.SNPfreq) + for f in vcfFileNames: + csi = f + '.csi' + log = f[:-3] + '.log' + #Move files to snps directory + os.rename(f, os.path.join('snps', f)) + os.rename(csi, os.path.join('snps', csi)) + os.rename(log, os.path.join('snps', log)) + + os.rename('merged.vcf', os.path.join('snps', 'merged.vcf')) + os.rename('vcfconcat.log', os.path.join('snps', 'vcfconcat.log')) + snps = recordSNPs(os.path.join('snps', 'merged.vcf')) elif not args.useSNPs: snps = None @@ -81,33 +89,31 @@ elif not args.maskbed: maskpositions = None - #Filter bam for reads contained within entries in geneBed - #This will reduce the amount of time it takes to find conversions - print('Filtering bam for reads contained within regions of interest...') - if args.nproc == 1: - intersectreads(args.bam, args.geneBed, args.chromsizes) - filteredbam = filterbam(args.bam, args.nproc) - elif args.nproc > 1: - intersectreads_multiprocess(args.bam, args.geneBed, args.chromsizes, args.nproc) - filteredbam = filterbam(args.bam, args.nproc) - - #Identify conversions - if args.nproc == 1: - convs, readcounter = iteratereads_pairedend(filteredbam, args.onlyConsiderOverlap, args.use_g_t, args.use_g_c, args.use_read1, args.use_read2, args.nConv, snps, maskpositions, 'high') - elif args.nproc > 1: - convs = getmismatches(filteredbam, args.onlyConsiderOverlap, snps, maskpositions, args.nConv, args.nproc, args.use_g_t, args.use_g_c, args.use_read1, args.use_read2) - - - #Assign reads to genes - print('Assigning reads to genes...') - overlaps, numpairs = getReadOverlaps(filteredbam, args.geneBed, args.chromsizes) - read2gene = processOverlaps(overlaps, numpairs) - os.remove(filteredbam) - os.remove(filteredbam + '.bai') - - #Calculate number of conversions per gene - numreadspergene, convsPerGene = getPerGene(convs, read2gene) - writeConvsPerGene(numreadspergene, convsPerGene, args.output, args.use_g_t, args.use_g_c) + #For each sample, identify conversions, assign conversions to transcripts, + #and collapse transcript-level measurements to gene-level measurements. + for ind, sample in enumerate(samplenames): + print('Running PIGPEN for {0}...'.format(sample)) + starbam = starbams[ind] + if args.nproc == 1: + convs, readcounter = iteratereads_pairedend(starbam, args.onlyConsiderOverlap, args.use_g_t, args.use_g_c, args.use_read1, args.use_read2, args.nConv, snps, maskpositions, 'high') + elif args.nproc > 1: + convs = getmismatches(starbam, args.onlyConsiderOverlap, snps, maskpositions, args.nConv, args.nproc, args.use_g_t, args.use_g_c, args.use_read1, args.use_read2) + + print('Getting posterior probabilities from salmon alignment file...') + postmasterbam = postmasterbams[ind] + pprobs = getpostmasterassignments(postmasterbam) + print('Assinging conversions to transcripts...') + txconvs = assigntotxs(pprobs, convs) + print('Collapsing transcript level conversion counts to gene level...') + tx2gene, geneconvs = collapsetogene(txconvs, args.gff) + print('Counting number of reads assigned to each gene...') + salmonquant = salmonquants[ind] + genecounts = readspergene(salmonquant, tx2gene) + print('Writing output...') + outputfile = sample + '.pigpen.txt' + writeOutput(geneconvs, genecounts, outputfile, args.use_g_t, args.use_g_c) + print('Done!') + From 40c76723b5348874e6c47ea560c8f9b07fc16627 Mon Sep 17 00:00:00 2001 From: Matthew Taliaferro Date: Fri, 15 Jul 2022 10:12:24 -0600 Subject: [PATCH 027/108] fix bug where convs per tx were being overwritten --- assignreads_salmon.py | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/assignreads_salmon.py b/assignreads_salmon.py index 755243b..903ba16 100644 --- a/assignreads_salmon.py +++ b/assignreads_salmon.py @@ -57,7 +57,10 @@ def assigntotxs(pprobs, convs): pprob = pprobs[readid][txid] for conv in readconvs: scaledconv = readconvs[conv] * pprob - txconvs[txid][conv] = scaledconv + if conv not in txconvs[txid]: + txconvs[txid][conv] = scaledconv + else: + txconvs[txid][conv] += scaledconv readswithtxs = len(convs) - readswithoutassignment pct = round(readswithtxs / len(convs), 2) * 100 From abdd808ee31a4a887b71ea45874560cf6956ff1c Mon Sep 17 00:00:00 2001 From: Matthew Taliaferro Date: Fri, 15 Jul 2022 10:13:05 -0600 Subject: [PATCH 028/108] remove salmonbam after creating it --- alignAndQuant.py | 3 +++ 1 file changed, 3 insertions(+) diff --git a/alignAndQuant.py b/alignAndQuant.py index b25eca7..25df4c2 100644 --- a/alignAndQuant.py +++ b/alignAndQuant.py @@ -133,6 +133,9 @@ def runPostmaster(samplename, nthreads): command = ['samtools', 'index', outputfile] subprocess.call(command) + + #We don't need the salmon alignment file anymore, and it's pretty big + os.remove(salmonbam) print('Finished postmaster for {0}!'.format(samplename)) From cd4ec282b5120079965be7468e2602a3228acd24 Mon Sep 17 00:00:00 2001 From: Matthew Taliaferro Date: Fri, 15 Jul 2022 10:13:41 -0600 Subject: [PATCH 029/108] update python version in README --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 7529d2a..144c0a1 100644 --- a/README.md +++ b/README.md @@ -32,7 +32,7 @@ Following the creation of alignment files produced by `STAR` and `postmaster` as PIGPEN has the following prerequisites: -- python >= 3.6 +- python >= 3.8 - samtools >= 1.15 - varscan >= 2.4.4 - bcftools >= 1.15 From cf149953e15706a23f3ca9ba3a2407ae98b8bae6 Mon Sep 17 00:00:00 2001 From: Matthew Taliaferro Date: Fri, 15 Jul 2022 11:27:49 -0600 Subject: [PATCH 030/108] add parameters to output and create outputDir --- assignreads_salmon.py | 9 +++++++- pigpen.py | 51 +++++++++++++++++++++++++++++++------------ 2 files changed, 45 insertions(+), 15 deletions(-) diff --git a/assignreads_salmon.py b/assignreads_salmon.py index 903ba16..b700742 100644 --- a/assignreads_salmon.py +++ b/assignreads_salmon.py @@ -152,7 +152,7 @@ def readspergene(quantsf, tx2gene): return genecounts -def writeOutput(geneconvs, genecounts, geneid2genename, outfile, use_g_t, use_g_c): +def writeOutput(sampleparams, geneconvs, genecounts, geneid2genename, outfile, use_g_t, use_g_c): #Write number of conversions and readcounts for genes. possibleconvs = [ 'a_a', 'a_t', 'a_c', 'a_g', 'a_n', @@ -161,6 +161,9 @@ def writeOutput(geneconvs, genecounts, geneid2genename, outfile, use_g_t, use_g_ 't_a', 't_t', 't_c', 't_g', 't_n'] with open(outfile, 'w') as outfh: + #Write arguments for this pigpen run + for arg in sampleparams: + outfh.write('#' + arg + '\t' + str(sampleparams[arg]) + '\n') #total G is number of ref Gs encountered #convG is g_t + g_c (the ones we are interested in) outfh.write(('\t').join(['GeneID', 'GeneName', 'numreads'] + possibleconvs + [ @@ -229,6 +232,10 @@ def writeOutput(geneconvs, genecounts, geneid2genename, outfile, use_g_t, use_g_ porc = 'NA' #Format numbers for printing + if type(numreads) == float: + numreads = '{:.2f}'.format(numreads) + if type(totalG) == float: + totalG = '{:.2f}'.format(totalG) if type(convGrate) == float: convGrate = '{:.2e}'.format(convGrate) if type(g_trate) == float: diff --git a/pigpen.py b/pigpen.py index f9d14be..2c192f9 100644 --- a/pigpen.py +++ b/pigpen.py @@ -13,11 +13,12 @@ if __name__ == '__main__': parser = argparse.ArgumentParser(description=' ,-,-----,\n PIGPEN **** \\ \\ ),)`-\'\n <`--\'> \\ \\` \n /. . `-----,\n OINC! > (\'\') , @~\n `-._, ___ /\n-|-|-|-|-|-|-|-| (( / (( / -|-|-| \n|-|-|-|-|-|-|-|- \'\'\' \'\'\' -|-|-|-\n-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|\n\n Pipeline for Identification \n Of Guanosine Positions\n Erroneously Notated', formatter_class = argparse.RawDescriptionHelpFormatter) parser.add_argument('--samplenames', type = str, help = 'Comma separated list of samples to quantify.', required = True) - parser.add_argument('--controlsamples', type = str, help = 'Comma separated list of control samples (i.e. those where no *induced* conversions are expected). May be a subset of samplenames. Required if SNPs are to be considered.') + parser.add_argument('--controlsamples', type = str, help = 'Comma separated list of control samples (i.e. those where no *induced* conversions are expected). May be a subset of samplenames. Required if SNPs are to be considered and a snpfile is not supplied.') parser.add_argument('--gff', type = str, help = 'Genome annotation in gff format.') parser.add_argument('--genomeFasta', type = str, help = 'Genome sequence in fasta format. Required if SNPs are to be considered.') parser.add_argument('--nproc', type = int, help = 'Number of processors to use. Default is 1.', default = 1) parser.add_argument('--useSNPs', action = 'store_true', help = 'Consider SNPs?') + parser.add_argument('--snpfile', type = str, help = 'VCF file of snps to mask. If --useSNPs but a --snpfile is not supplied, a VCF of snps will be created using --controlsamples.') parser.add_argument('--maskbed', help = 'Optional. Bed file of positions to mask from analysis.', default = None) parser.add_argument('--SNPcoverage', type = int, help = 'Minimum coverage to call SNPs. Default = 20', default = 20) parser.add_argument('--SNPfreq', type = float, help = 'Minimum variant frequency to call SNPs. Default = 0.2', default = 0.2) @@ -27,8 +28,15 @@ parser.add_argument('--use_read1', action = 'store_true', help = 'Use read1 when looking for conversions?') parser.add_argument('--use_read2', action = 'store_true', help = 'Use read2 when looking for conversions?') parser.add_argument('--nConv', type = int, help = 'Minimum number of required G->T and/or G->C conversions in a read pair in order for conversions to be counted. Default is 1.', default = 1) + parser.add_argument('--outputDir', type = str, help = 'Output directory.', required = True) args = parser.parse_args() + #Store command line arguments + suppliedargs = {} + for arg in vars(args): + if arg != 'samplenames': + suppliedargs[arg] = getattr(args, arg) + #Take in list of samplenames to run pigpen on #Derive quant.sf, STAR bams, and postmaster bams samplenames = args.samplenames.split(',') @@ -37,15 +45,16 @@ postmasterbams = [os.path.join(x, 'postmaster', '{0}.postmaster.bam'.format(x)) for x in samplenames] #Take in list of control samples, make list of their corresponding star bams for SNP calling - controlsamples = args.controlsamples.split(',') - controlindicies = [] - for ind, x in enumerate(samplenames): - if x in controlsamples: - controlindicies.append(ind) + if args.controlsamples: + controlsamples = args.controlsamples.split(',') + controlindicies = [] + for ind, x in enumerate(samplenames): + if x in controlsamples: + controlindicies.append(ind) - controlstarbams = [] - for x in controlindicies: - controlstarbams.append(starbams[x]) + controlstarbams = [] + for x in controlindicies: + controlstarbams.append(starbams[x]) #We have to be either looking for G->T or G->C, if not both if not args.use_g_t and not args.use_g_c: @@ -63,7 +72,12 @@ sys.exit() #Make vcf file for snps - if args.useSNPs: + if args.snpfile: + snps = recordSNPs(args.snpfile) + if args.useSNPs and not args.snpfile and not args.controlsamples: + print('ERROR: If we want to consider snps we either have to give control samples for finding snps or a vcf file of snps we already know!') + sys.exit() + if args.useSNPs and not args.snpfile: if not os.path.exists('snps'): os.mkdir('snps') vcfFileNames = getSNPs(controlstarbams, args.genomeFasta, args.SNPcoverage, args.SNPfreq) @@ -79,7 +93,7 @@ os.rename('vcfconcat.log', os.path.join('snps', 'vcfconcat.log')) snps = recordSNPs(os.path.join('snps', 'merged.vcf')) - elif not args.useSNPs: + elif not args.useSNPs and not args.snpfile: snps = None #Get positions to manually mask if given @@ -92,8 +106,13 @@ #For each sample, identify conversions, assign conversions to transcripts, #and collapse transcript-level measurements to gene-level measurements. for ind, sample in enumerate(samplenames): + #Create paramter dictionary that is unique to this sample + sampleparams = suppliedargs + sampleparams['sample'] = sample + print('Running PIGPEN for {0}...'.format(sample)) starbam = starbams[ind] + sampleparams['starbam'] = os.path.abspath(starbam) if args.nproc == 1: convs, readcounter = iteratereads_pairedend(starbam, args.onlyConsiderOverlap, args.use_g_t, args.use_g_c, args.use_read1, args.use_read2, args.nConv, snps, maskpositions, 'high') elif args.nproc > 1: @@ -101,17 +120,21 @@ print('Getting posterior probabilities from salmon alignment file...') postmasterbam = postmasterbams[ind] + sampleparams['postmasterbam'] = os.path.abspath(postmasterbam) pprobs = getpostmasterassignments(postmasterbam) print('Assinging conversions to transcripts...') txconvs = assigntotxs(pprobs, convs) print('Collapsing transcript level conversion counts to gene level...') - tx2gene, geneconvs = collapsetogene(txconvs, args.gff) + tx2gene, geneid2genename, geneconvs = collapsetogene(txconvs, args.gff) print('Counting number of reads assigned to each gene...') salmonquant = salmonquants[ind] + sampleparams['salmonquant'] = os.path.abspath(salmonquant) genecounts = readspergene(salmonquant, tx2gene) print('Writing output...') - outputfile = sample + '.pigpen.txt' - writeOutput(geneconvs, genecounts, outputfile, args.use_g_t, args.use_g_c) + if not os.path.exists(args.outputDir): + os.mkdir(args.outputDir) + outputfile = os.path.join(args.outputDir, sample + '.pigpen.txt') + writeOutput(sampleparams, geneconvs, genecounts, geneid2genename, outputfile, args.use_g_t, args.use_g_c) print('Done!') From c6aa3c093fa8d610d570f427b93ca890effbfc82 Mon Sep 17 00:00:00 2001 From: Matthew Taliaferro Date: Fri, 15 Jul 2022 17:21:02 -0600 Subject: [PATCH 031/108] format convG in output --- assignreads_salmon.py | 2 ++ 1 file changed, 2 insertions(+) diff --git a/assignreads_salmon.py b/assignreads_salmon.py index b700742..c05aed0 100644 --- a/assignreads_salmon.py +++ b/assignreads_salmon.py @@ -234,6 +234,8 @@ def writeOutput(sampleparams, geneconvs, genecounts, geneid2genename, outfile, u #Format numbers for printing if type(numreads) == float: numreads = '{:.2f}'.format(numreads) + if type(convG) == float: + convG = '{:.2f}'.format(convG) if type(totalG) == float: totalG = '{:.2f}'.format(totalG) if type(convGrate) == float: From 6b122da6c8db85252003eba47ddcc9c33666b9fd Mon Sep 17 00:00:00 2001 From: Matthew Taliaferro Date: Wed, 27 Jul 2022 16:00:56 -0600 Subject: [PATCH 032/108] deal inelegantly with tx versions in salmon output --- assignreads_salmon.py | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/assignreads_salmon.py b/assignreads_salmon.py index c05aed0..04c7800 100644 --- a/assignreads_salmon.py +++ b/assignreads_salmon.py @@ -137,7 +137,7 @@ def readspergene(quantsf, tx2gene): line = line.strip().split('\t') if line[0] == 'Name': continue - txid = line[0] + txid = line[0].split('.')[0] #remove tx id version in the salmon quant.sf if it exists counts = float(line[4]) txcounts[txid] = counts @@ -146,7 +146,12 @@ def readspergene(quantsf, tx2gene): genecounts[gene] = 0 for txid in txcounts: - geneid = tx2gene[txid] + try: + geneid = tx2gene[txid] + except KeyError: #maybe the salmon tx id have version numbers + txid = txid.split('.')[0] + geneid = tx2gene[txid] + genecounts[geneid] += txcounts[txid] return genecounts From 2cd23ab570bdff10e09e452e8c6e70159a4c7cc9 Mon Sep 17 00:00:00 2001 From: Matthew Taliaferro Date: Fri, 29 Jul 2022 10:25:03 -0600 Subject: [PATCH 033/108] add look for convs in defined regions --ROIbed --- conversionsPerGene.py | 5 +- pigpen.py | 113 +++++++++++++++++++++++++++--------------- 2 files changed, 77 insertions(+), 41 deletions(-) diff --git a/conversionsPerGene.py b/conversionsPerGene.py index 493c1cd..21bce0c 100644 --- a/conversionsPerGene.py +++ b/conversionsPerGene.py @@ -51,7 +51,7 @@ def getPerGene(convs, reads2gene): return numreadspergene, convsPerGene -def writeConvsPerGene(numreadspergene, convsPerGene, outfile, use_g_t, use_g_c): +def writeConvsPerGene(sampleparams, numreadspergene, convsPerGene, outfile, use_g_t, use_g_c): possibleconvs = [ 'a_a', 'a_t', 'a_c', 'a_g', 'a_n', 'g_a', 'g_t', 'g_c', 'g_g', 'g_n', @@ -59,6 +59,9 @@ def writeConvsPerGene(numreadspergene, convsPerGene, outfile, use_g_t, use_g_c): 't_a', 't_t', 't_c', 't_g', 't_n'] with open(outfile, 'w') as outfh: + #Write arguments for this pigpen run + for arg in sampleparams: + outfh.write('#' + arg + '\t' + str(sampleparams[arg]) + '\n') #total G is number of ref Gs encountered #convG is g_t + g_c (the ones we are interested in) outfh.write(('\t').join(['Gene', 'numreads'] + possibleconvs + ['totalG', 'convG', 'convGrate', 'G_Trate', 'G_Crate', 'porc']) + '\n') diff --git a/pigpen.py b/pigpen.py index 2c192f9..ac13a1e 100644 --- a/pigpen.py +++ b/pigpen.py @@ -9,6 +9,8 @@ from maskpositions import readmaskbed from getmismatches import iteratereads_pairedend, getmismatches from assignreads_salmon import getpostmasterassignments, assigntotxs, collapsetogene, readspergene, writeOutput +from assignreads import getReadOverlaps, processOverlaps +from conversionsPerGene import getPerGene, writeConvsPerGene if __name__ == '__main__': parser = argparse.ArgumentParser(description=' ,-,-----,\n PIGPEN **** \\ \\ ),)`-\'\n <`--\'> \\ \\` \n /. . `-----,\n OINC! > (\'\') , @~\n `-._, ___ /\n-|-|-|-|-|-|-|-| (( / (( / -|-|-| \n|-|-|-|-|-|-|-|- \'\'\' \'\'\' -|-|-|-\n-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|\n\n Pipeline for Identification \n Of Guanosine Positions\n Erroneously Notated', formatter_class = argparse.RawDescriptionHelpFormatter) @@ -20,6 +22,7 @@ parser.add_argument('--useSNPs', action = 'store_true', help = 'Consider SNPs?') parser.add_argument('--snpfile', type = str, help = 'VCF file of snps to mask. If --useSNPs but a --snpfile is not supplied, a VCF of snps will be created using --controlsamples.') parser.add_argument('--maskbed', help = 'Optional. Bed file of positions to mask from analysis.', default = None) + parser.add_argument('--ROIbed', help = 'Optional. Bed file of specific regions of interest in which to quantify conversions. If supplied, only conversions in these regions will be quantified.', default = None) parser.add_argument('--SNPcoverage', type = int, help = 'Minimum coverage to call SNPs. Default = 20', default = 20) parser.add_argument('--SNPfreq', type = float, help = 'Minimum variant frequency to call SNPs. Default = 0.2', default = 0.2) parser.add_argument('--onlyConsiderOverlap', action = 'store_true', help = 'Only consider conversions seen in both reads of a read pair?') @@ -103,44 +106,74 @@ elif not args.maskbed: maskpositions = None - #For each sample, identify conversions, assign conversions to transcripts, + #If there is no supplied bedfile of regions of interest, + #for each sample, identify conversions, assign conversions to transcripts, #and collapse transcript-level measurements to gene-level measurements. - for ind, sample in enumerate(samplenames): - #Create paramter dictionary that is unique to this sample - sampleparams = suppliedargs - sampleparams['sample'] = sample - - print('Running PIGPEN for {0}...'.format(sample)) - starbam = starbams[ind] - sampleparams['starbam'] = os.path.abspath(starbam) - if args.nproc == 1: - convs, readcounter = iteratereads_pairedend(starbam, args.onlyConsiderOverlap, args.use_g_t, args.use_g_c, args.use_read1, args.use_read2, args.nConv, snps, maskpositions, 'high') - elif args.nproc > 1: - convs = getmismatches(starbam, args.onlyConsiderOverlap, snps, maskpositions, args.nConv, args.nproc, args.use_g_t, args.use_g_c, args.use_read1, args.use_read2) - - print('Getting posterior probabilities from salmon alignment file...') - postmasterbam = postmasterbams[ind] - sampleparams['postmasterbam'] = os.path.abspath(postmasterbam) - pprobs = getpostmasterassignments(postmasterbam) - print('Assinging conversions to transcripts...') - txconvs = assigntotxs(pprobs, convs) - print('Collapsing transcript level conversion counts to gene level...') - tx2gene, geneid2genename, geneconvs = collapsetogene(txconvs, args.gff) - print('Counting number of reads assigned to each gene...') - salmonquant = salmonquants[ind] - sampleparams['salmonquant'] = os.path.abspath(salmonquant) - genecounts = readspergene(salmonquant, tx2gene) - print('Writing output...') - if not os.path.exists(args.outputDir): - os.mkdir(args.outputDir) - outputfile = os.path.join(args.outputDir, sample + '.pigpen.txt') - writeOutput(sampleparams, geneconvs, genecounts, geneid2genename, outputfile, args.use_g_t, args.use_g_c) - print('Done!') - - - - - - - - + if not args.ROIbed: + for ind, sample in enumerate(samplenames): + #Create parameter dictionary that is unique to this sample + sampleparams = suppliedargs + sampleparams['sample'] = sample + + print('Running PIGPEN for {0}...'.format(sample)) + starbam = starbams[ind] + sampleparams['starbam'] = os.path.abspath(starbam) + if args.nproc == 1: + convs, readcounter = iteratereads_pairedend(starbam, args.onlyConsiderOverlap, args.use_g_t, args.use_g_c, args.use_read1, args.use_read2, args.nConv, snps, maskpositions, 'high') + elif args.nproc > 1: + convs = getmismatches(starbam, args.onlyConsiderOverlap, snps, maskpositions, args.nConv, args.nproc, args.use_g_t, args.use_g_c, args.use_read1, args.use_read2) + + print('Getting posterior probabilities from salmon alignment file...') + postmasterbam = postmasterbams[ind] + sampleparams['postmasterbam'] = os.path.abspath(postmasterbam) + pprobs = getpostmasterassignments(postmasterbam) + print('Assinging conversions to transcripts...') + txconvs = assigntotxs(pprobs, convs) + print('Collapsing transcript level conversion counts to gene level...') + tx2gene, geneid2genename, geneconvs = collapsetogene(txconvs, args.gff) + print('Counting number of reads assigned to each gene...') + salmonquant = salmonquants[ind] + sampleparams['salmonquant'] = os.path.abspath(salmonquant) + genecounts = readspergene(salmonquant, tx2gene) + print('Writing output...') + if not os.path.exists(args.outputDir): + os.mkdir(args.outputDir) + outputfile = os.path.join(args.outputDir, sample + '.pigpen.txt') + writeOutput(sampleparams, geneconvs, genecounts, geneid2genename, outputfile, args.use_g_t, args.use_g_c) + print('Done!') + + #If there is a bed file of regions of interest supplied, then use that. Don't use the salmon/postmaster quantifications. + elif args.ROIbed: + #Make fasta index + command = ['samtools', 'faidx', args.genomeFasta] + subprocess.call(command) + faidx = args.genomeFasta + '.fai' + + #Create chrsort + command = ['cut', '-f' '1,2', faidx] + with open('chrsort.txt', 'w') as outfh: + subprocess.run(command, stdout = outfh) + + for ind, sample in enumerate(samplenames): + #Create parameter dictionary that is unique to this sample + sampleparams = suppliedargs + sampleparams['sample'] = sample + + print('Running PIGPEN for {0}...'.format(sample)) + starbam = starbams[ind] + sampleparams['starbam'] = os.path.abspath(starbam) + if args.nproc == 1: + convs, readcounter = iteratereads_pairedend( + starbam, args.onlyConsiderOverlap, args.use_g_t, args.use_g_c, args.use_read1, args.use_read2, args.nConv, snps, maskpositions, 'high') + elif args.nproc > 1: + convs = getmismatches(starbam, args.onlyConsiderOverlap, snps, maskpositions, + args.nConv, args.nproc, args.use_g_t, args.use_g_c, args.use_read1, args.use_read2) + + print('Assigning reads to genes in supplied bed file...') + overlaps, numpairs = getReadOverlaps(starbam, args.ROIbed, 'chrsort.txt') + read2gene = processOverlaps(overlaps, numpairs) + numreadspergene, convsPerGene = getPerGene(convs, read2gene) + if not os.path.exists(args.outputDir): + os.mkdir(args.outputDir) + outputfile = os.path.join(args.outputDir, sample + '.pigpen.txt') + writeConvsPerGene(sampleparams, numreadspergene, convsPerGene, outputfile, args.use_g_t, args.use_g_c) \ No newline at end of file From e14a71980c512031cafe97c246b2ad98223cf02a Mon Sep 17 00:00:00 2001 From: vaethk <101118118+vaethk@users.noreply.github.com> Date: Fri, 29 Jul 2022 15:06:57 -0600 Subject: [PATCH 034/108] fixed tx verison issue --- assignreads_salmon.py | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/assignreads_salmon.py b/assignreads_salmon.py index 04c7800..05234aa 100644 --- a/assignreads_salmon.py +++ b/assignreads_salmon.py @@ -23,7 +23,7 @@ def getpostmasterassignments(postmasterbam): if read.is_read2: continue readid = read.query_name - tx = read.reference_name + tx = read.reference_name.split('.')[0] pprob = read.get_tag(tag='ZW') if readid not in pprobs: pprobs[readid] = {} @@ -52,6 +52,7 @@ def assigntotxs(pprobs, convs): continue for txid in pprobs[readid]: + txid = txid.split('.')[0] if txid not in txconvs: txconvs[txid] = {} pprob = pprobs[readid][txid] From 8d2bd3791041595c921b909c3822ac3058eac300 Mon Sep 17 00:00:00 2001 From: vaethk <101118118+vaethk@users.noreply.github.com> Date: Thu, 4 Aug 2022 15:44:26 -0600 Subject: [PATCH 035/108] Fixed error when reading in .txt output --- bacon_glm.py | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/bacon_glm.py b/bacon_glm.py index 5b41ee6..f27e486 100644 --- a/bacon_glm.py +++ b/bacon_glm.py @@ -36,6 +36,14 @@ def readconditions(samp_conds_file): sampconddf = pd.read_csv(samp_conds_file, sep = '\t', index_col = False, header = 0) return sampconddf +def count_comments(pigpenfile): + count = 0 + with open(pigpenfile, 'r') as infh: + for line in infh: + line = line.strip() + if line.startswith('#'): + count +=1 + return count def makePORCdf(samp_conds_file, minreads): #Make a dataframe of PORC values for all samples @@ -54,7 +62,9 @@ def makePORCdf(samp_conds_file, minreads): continue pigpenfile = line[0] sample = line[1] - df = pd.read_csv(pigpenfile, sep = '\t', index_col = False, header = 0) + skip_count = count_comments(pigpenfile) + header_number = skip_count + 1 + df = pd.read_csv(pigpenfile, sep = '\t', index_col = False, header = header_number, skiprows = skip_count) dfgenes = df['Gene'].tolist() samplecolumn = [sample] * len(dfgenes) df = df.assign(sample = samplecolumn) From 6a02f764bb806a3ef6594eef8a016732a9b3301b Mon Sep 17 00:00:00 2001 From: Matthew Taliaferro Date: Fri, 26 Aug 2022 16:15:29 -0600 Subject: [PATCH 036/108] add min overlap length --- assignreads.py | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/assignreads.py b/assignreads.py index 3021ff6..b399804 100644 --- a/assignreads.py +++ b/assignreads.py @@ -54,8 +54,9 @@ def processOverlaps(overlaps, numpairs): txs = overlaps[read] maxtx = max(txs, key = txs.get) overlaplength = txs[maxtx] #can implement minimum overlap here - gene = maxtx.split('_')[0] - read2gene[read] = gene + if overlaplength >= 225: + gene = maxtx.split('_')[0] + read2gene[read] = gene frac_readpairs_with_gene = round((len(read2gene) / numpairs) * 100, 2) print('Found genes for {0} read pairs ({1}%).'.format(len(read2gene), frac_readpairs_with_gene)) From c31f30f0c0eab2eebbcbbe2b324e71f71688f67a Mon Sep 17 00:00:00 2001 From: Matthew Taliaferro Date: Fri, 26 Aug 2022 16:16:07 -0600 Subject: [PATCH 037/108] add min mapping quality --- getmismatches.py | 8 ++++---- pigpen.py | 7 ++++--- 2 files changed, 8 insertions(+), 7 deletions(-) diff --git a/getmismatches.py b/getmismatches.py index b1290e3..6594f11 100644 --- a/getmismatches.py +++ b/getmismatches.py @@ -176,7 +176,7 @@ def findsnps(controlbams, genomefasta, minCoverage = 20, minVarFreq = 0.02): return snps -def iteratereads_pairedend(bam, onlyConsiderOverlap, use_g_t, use_g_c, use_read1, use_read2, nConv, snps=None, maskpositions=None, verbosity='high'): +def iteratereads_pairedend(bam, onlyConsiderOverlap, use_g_t, use_g_c, use_read1, use_read2, nConv, minMappingQual, snps=None, maskpositions=None, verbosity='high'): #Iterate over reads in a paired end alignment file. #Find nt conversion locations for each read. #For locations interrogated by both mates of read pair, conversion must exist in both mates in order to count @@ -205,7 +205,7 @@ def iteratereads_pairedend(bam, onlyConsiderOverlap, use_g_t, use_g_c, use_read1 #Check mapping quality #MapQ is 255 for uniquely aligned reads FOR STAR ONLY - if read1.mapping_quality < 255 or read2.mapping_quality < 255: + if read1.mapping_quality < minMappingQual or read2.mapping_quality < minMappingQual: continue readcounter +=1 @@ -615,7 +615,7 @@ def split_bam(bam, nproc): return splitbams -def getmismatches(bam, onlyConsiderOverlap, snps, maskpositions, nConv, nproc, use_g_t, use_g_c, use_read1, use_read2): +def getmismatches(bam, onlyConsiderOverlap, snps, maskpositions, nConv, minMappingQual, nproc, use_g_t, use_g_c, use_read1, use_read2): #Actually run the mismatch code (calling iteratereads_pairedend) #use multiprocessing #If there's only one processor, easier to use iteratereads_pairedend() directly. @@ -626,7 +626,7 @@ def getmismatches(bam, onlyConsiderOverlap, snps, maskpositions, nConv, nproc, u argslist = [] for x in splitbams: argslist.append((x, bool(onlyConsiderOverlap), bool( - use_g_t), bool(use_g_c), bool(use_read1), bool(use_read2), nConv, snps, maskpositions, 'low')) + use_g_t), bool(use_g_c), bool(use_read1), bool(use_read2), nConv, minMappingQual, snps, maskpositions, 'low')) #items returned from iteratereads_pairedend are in a list, one per process totalreadcounter = 0 #number of reads across all the split bams diff --git a/pigpen.py b/pigpen.py index ac13a1e..7a7993a 100644 --- a/pigpen.py +++ b/pigpen.py @@ -24,12 +24,13 @@ parser.add_argument('--maskbed', help = 'Optional. Bed file of positions to mask from analysis.', default = None) parser.add_argument('--ROIbed', help = 'Optional. Bed file of specific regions of interest in which to quantify conversions. If supplied, only conversions in these regions will be quantified.', default = None) parser.add_argument('--SNPcoverage', type = int, help = 'Minimum coverage to call SNPs. Default = 20', default = 20) - parser.add_argument('--SNPfreq', type = float, help = 'Minimum variant frequency to call SNPs. Default = 0.2', default = 0.2) + parser.add_argument('--SNPfreq', type = float, help = 'Minimum variant frequency to call SNPs. Default = 0.4', default = 0.4) parser.add_argument('--onlyConsiderOverlap', action = 'store_true', help = 'Only consider conversions seen in both reads of a read pair?') parser.add_argument('--use_g_t', action = 'store_true', help = 'Consider G->T conversions?') parser.add_argument('--use_g_c', action = 'store_true', help = 'Consider G->C conversions?') parser.add_argument('--use_read1', action = 'store_true', help = 'Use read1 when looking for conversions?') parser.add_argument('--use_read2', action = 'store_true', help = 'Use read2 when looking for conversions?') + parser.add_argument('--minMappingQual', type = int, help = 'Minimum mapping quality for a read to be considered in conversion counting.') parser.add_argument('--nConv', type = int, help = 'Minimum number of required G->T and/or G->C conversions in a read pair in order for conversions to be counted. Default is 1.', default = 1) parser.add_argument('--outputDir', type = str, help = 'Output directory.', required = True) args = parser.parse_args() @@ -119,9 +120,9 @@ starbam = starbams[ind] sampleparams['starbam'] = os.path.abspath(starbam) if args.nproc == 1: - convs, readcounter = iteratereads_pairedend(starbam, args.onlyConsiderOverlap, args.use_g_t, args.use_g_c, args.use_read1, args.use_read2, args.nConv, snps, maskpositions, 'high') + convs, readcounter = iteratereads_pairedend(starbam, args.onlyConsiderOverlap, args.use_g_t, args.use_g_c, args.use_read1, args.use_read2, args.nConv, args.minMappingQual, snps, maskpositions, 'high') elif args.nproc > 1: - convs = getmismatches(starbam, args.onlyConsiderOverlap, snps, maskpositions, args.nConv, args.nproc, args.use_g_t, args.use_g_c, args.use_read1, args.use_read2) + convs = getmismatches(starbam, args.onlyConsiderOverlap, snps, maskpositions, args.nConv, args.minMappingQual, args.nproc, args.use_g_t, args.use_g_c, args.use_read1, args.use_read2) print('Getting posterior probabilities from salmon alignment file...') postmasterbam = postmasterbams[ind] From 632ceb821a24db048089a8fd8bfec897bdd78bc0 Mon Sep 17 00:00:00 2001 From: Matthew Taliaferro Date: Fri, 26 Aug 2022 16:16:53 -0600 Subject: [PATCH 038/108] add alignandquant2 --- alignAndQuant2.py | 180 ++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 180 insertions(+) create mode 100644 alignAndQuant2.py diff --git a/alignAndQuant2.py b/alignAndQuant2.py new file mode 100644 index 0000000..b16a9de --- /dev/null +++ b/alignAndQuant2.py @@ -0,0 +1,180 @@ +import os +import subprocess +import sys +import shutil +import argparse + +#Given a pair of read files, align reads using STAR and quantify/align reads using salmon. +#This will make a STAR-produced bam (for pigpen mutation calling) and a salmon-produced bam (for read assignment). +#It will then run postmaster to append transcript assignments to the salmon-produced bam. + +#This is going to take in gzipped fastqs, a directory containing the STAR index for this genome, and a directory containing the salmon index for this genome. + +#Reads are aligned to the genome using STAR. This bam file will be used for mutation calling. +#In this alignment, we allow multiple mapping reads, but only report the best alignment. + +#Reads are then quantified using salmon, where a separate transcriptome-oriented bam is written. +#Postmaster then takes this bam and adds posterior probabilities for transcript assignments. + +#When runSTAR(), bamtofastq(), runSalmon(), and runPostmaster() are run in succession, the output is a directory called . +#In this directory, the STAR output is Aligned.sortedByCoord.out.bam in STAR/, +#the salmon output is .quant.sf and .salmon.bam in salmon/, +#and the postmaster output is .postmaster.bam in postmaster/ + +#Requires STAR, salmon(>= 1.9.0), and postmaster be in user's PATH. + +def runSTAR(reads1, reads2, nthreads, STARindex, samplename): + if not os.path.exists('STAR'): + os.mkdir('STAR') + + cwd = os.getcwd() + outdir = os.path.join(cwd, 'STAR') + + #Clean output directory if it already exists + if os.path.exists(outdir) and os.path.isdir(outdir): + shutil.rmtree(outdir) + + os.mkdir(outdir) + prefix = outdir + '/' + samplename + + command = ['STAR', '--runMode', 'alignReads', '--runThreadN', nthreads, '--genomeLoad', 'NoSharedMemory', '--genomeDir', STARindex, '--readFilesIn', reads1, reads2, '--readFilesCommand', + 'zcat', '--outFileNamePrefix', prefix, '--outSAMtype', 'BAM', 'SortedByCoordinate', '--outSAMstrandField', 'intronMotif', '--outSAMmultNmax', '1', '--outSAMattributes', 'MD', 'NH'] + + print('Running STAR for {0}...'.format(samplename)) + + subprocess.run(command) + + #make index + bam = os.path.join(outdir, samplename + 'Aligned.sortedByCoord.out.bam') + bamindex = bam + '.bai' + if not os.path.exists(bamindex): + indexCMD = 'samtools index ' + bam + index = subprocess.Popen(indexCMD, shell=True) + index.wait() + + print('Finished STAR for {0}!'.format(samplename)) + + +def bamtofastq(samplename, nthreads): + #Given a bam file of uniquely aligned reads (produced from runSTAR), rederive these reads as fastq in preparation for submission to salmon + #This function isn't needed anymore as we will align all reads. + if not os.path.exists('STAR'): + os.mkdir('STAR') + + cwd = os.getcwd() + outdir = os.path.join(cwd, 'STAR') + inbam = os.path.join(outdir, samplename + 'Aligned.sortedByCoord.out.bam') + sortedbam = os.path.join(outdir, 'temp.namesort.bam') + + #First sort bam file by readname + print('Sorting bam file by read name...') + command = ['samtools', 'collate', '--threads', nthreads, '-u', '-o', sortedbam, inbam] + subprocess.call(command) + print('Done!') + + #Now derive fastq + r1file = samplename + '.unique.r1.fq.gz' + r2file = samplename + '.unique.r2.fq.gz' + print('Writing fastq file of uniquely aligned reads for {0}...'.format(samplename)) + command = ['samtools', 'fastq', '--threads', nthreads, '-1', r1file, '-2', r2file, '-0', '/dev/null', '-s', '/dev/null', '-n', sortedbam] + subprocess.call(command) + print('Done writing fastq files for {0}!'.format(samplename)) + + os.remove(sortedbam) + + +def runSalmon(reads1, reads2, nthreads, salmonindex, samplename): + #Take in those uniquely aligning reads and quantify transcript abundance with them using salmon. + + if not os.path.exists('salmon'): + os.mkdir('salmon') + + idx = os.path.abspath(salmonindex) + r1 = os.path.abspath(reads1) + r2 = os.path.abspath(reads2) + + os.chdir('salmon') + + command = ['salmon', 'quant', '--libType', 'A', '-p', nthreads, '--seqBias', '--gcBias', + '--validateMappings', '-1', r1, '-2', r2, '-o', samplename, '--index', idx, '--writeMappings={0}.salmon.bam'.format(samplename), '--writeQualities'] + + print('Running salmon for {0}...'.format(samplename)) + + subprocess.run(command) + + #Move output + outputdir = os.path.join(os.getcwd(), samplename) + quantfile = os.path.join(outputdir, 'quant.sf') + movedquantfile = os.path.join(os.getcwd(), '{0}.quant.sf'.format(samplename)) + os.rename(quantfile, movedquantfile) + + print('Finished salmon for {0}!'.format(samplename)) + +def runPostmaster(samplename, nthreads): + if not os.path.exists('postmaster'): + os.mkdir('postmaster') + + salmonquant = os.path.join(os.getcwd(), 'salmon', '{0}.quant.sf'.format(samplename)) + salmonbam = os.path.join(os.getcwd(), 'salmon', '{0}.salmon.bam'.format(samplename)) + + os.chdir('postmaster') + outputfile = os.path.join(os.getcwd(), '{0}.postmaster.bam'.format(samplename)) + + print('Running postmaster for {0}...'.format(samplename)) + command = ['postmaster', '--num-threads', nthreads, '--quant', salmonquant, '--alignments', salmonbam, '--output', outputfile] + subprocess.call(command) + + #Sort and index bam + with open(outputfile + '.sort', 'w') as sortedfh: + command = ['samtools', 'sort', '-@', nthreads, outputfile] + subprocess.run(command, stdout = sortedfh) + + os.rename(outputfile + '.sort', outputfile) + + command = ['samtools', 'index', outputfile] + subprocess.run(command) + + #We don't need the salmon alignment file anymore, and it's pretty big + os.remove(salmonbam) + + print('Finished postmaster for {0}!'.format(samplename)) + +def addMD(samplename, reffasta, nthreads): + inputbam = os.path.join(os.getcwd(), 'postmaster', '{0}.postmaster.bam'.format(samplename)) + command = ['samtools', 'calmd', '-b', '--threads', nthreads, inputbam, reffasta] + + print('Adding MD tags to {0}.postmaster.md.bam...'.format(samplename)) + with open(samplename + '.postmaster.md.bam', 'w') as outfile: + subprocess.run(command, stdout = outfile) + print('Finished adding MD tags to {0}.postmaster.md.bam!'.format(samplename)) + +if __name__ == '__main__': + parser = argparse.ArgumentParser(description = 'Align and quantify reads using STAR, salmon, and postmaster in preparation for analysis with PIGPEN.') + parser.add_argument('--forwardreads', type = str, help = 'Forward reads. Gzipped fastq.') + parser.add_argument('--reversereads', type = str, help = 'Reverse reads. Gzipped fastq.') + parser.add_argument('--nthreads', type = str, help = 'Number of threads to use for alignment and quantification.') + parser.add_argument('--STARindex', type = str, help = 'STAR index directory.') + parser.add_argument('--salmonindex', type = str, help = 'Salmon index directory.') + parser.add_argument('--samplename', type = str, help = 'Sample name. Will be appended to output files.') + args = parser.parse_args() + + r1 = os.path.abspath(args.forwardreads) + r2 = os.path.abspath(args.reversereads) + STARindex = os.path.abspath(args.STARindex) + salmonindex = os.path.abspath(args.salmonindex) + samplename = args.samplename + nthreads = args.nthreads + + wd = os.path.abspath(os.getcwd()) + sampledir = os.path.join(wd, samplename) + if os.path.exists(sampledir) and os.path.isdir(sampledir): + shutil.rmtree(sampledir) + os.mkdir(sampledir) + os.chdir(sampledir) + + runSTAR(r1, r2, nthreads, STARindex, samplename) + runSalmon(r1, r2, nthreads, salmonindex, samplename) + os.chdir(sampledir) + runPostmaster(samplename, nthreads) + + From cb6f0f51bfa769a0e1690d3b3f86abb0786d6f06 Mon Sep 17 00:00:00 2001 From: Matthew Taliaferro Date: Wed, 31 Aug 2022 10:34:45 -0600 Subject: [PATCH 039/108] add ability to handle comments in pigpen output --- bacon_glm.py | 68 +++++++++++++++++++++++++++++----------------------- 1 file changed, 38 insertions(+), 30 deletions(-) diff --git a/bacon_glm.py b/bacon_glm.py index f27e486..8f924e9 100644 --- a/bacon_glm.py +++ b/bacon_glm.py @@ -21,6 +21,7 @@ from rpy2.rinterface_lib.callbacks import logger as rpy2_logger import logging import warnings +import argparse #Need r-base, r-stats, r-lme4 @@ -36,14 +37,6 @@ def readconditions(samp_conds_file): sampconddf = pd.read_csv(samp_conds_file, sep = '\t', index_col = False, header = 0) return sampconddf -def count_comments(pigpenfile): - count = 0 - with open(pigpenfile, 'r') as infh: - for line in infh: - line = line.strip() - if line.startswith('#'): - count +=1 - return count def makePORCdf(samp_conds_file, minreads): #Make a dataframe of PORC values for all samples @@ -62,10 +55,8 @@ def makePORCdf(samp_conds_file, minreads): continue pigpenfile = line[0] sample = line[1] - skip_count = count_comments(pigpenfile) - header_number = skip_count + 1 - df = pd.read_csv(pigpenfile, sep = '\t', index_col = False, header = header_number, skiprows = skip_count) - dfgenes = df['Gene'].tolist() + df = pd.read_csv(pigpenfile, sep = '\t', index_col = False, comment = '#', header = 0) + dfgenes = df['GeneID'].tolist() samplecolumn = [sample] * len(dfgenes) df = df.assign(sample = samplecolumn) @@ -74,7 +65,7 @@ def makePORCdf(samp_conds_file, minreads): else: genesinall = genesinall.intersection(set(dfgenes)) - columnstokeep = ['Gene', 'sample', 'numreads', 'porc'] + columnstokeep = ['GeneID', 'GeneName', 'sample', 'numreads', 'porc'] df = df[columnstokeep] dfs.append(df) @@ -82,13 +73,13 @@ def makePORCdf(samp_conds_file, minreads): #Somehow there are some genes whose name in NA if np.nan in genesinall: genesinall.remove(np.nan) - dfs = [df.loc[df['Gene'].isin(genesinall)] for df in dfs] + dfs = [df.loc[df['GeneID'].isin(genesinall)] for df in dfs] #concatenate (rbind) dfs together df = pd.concat(dfs) #turn from long into wide - df = df.pivot_table(index = 'Gene', columns = 'sample', values = ['numreads', 'porc']).reset_index() + df = df.pivot_table(index = ['GeneID', 'GeneName'], columns = 'sample', values = ['numreads', 'porc']).reset_index() #flatten multiindex column names df.columns = ["_".join(a) if '' not in a else a[0] for a in df.columns.to_flat_index()] @@ -103,10 +94,11 @@ def makePORCdf(samp_conds_file, minreads): print('{0} genes have at least {1} reads in every sample.'.format(len(df), minreads)) #We also don't want rows with inf/-inf PORC values df = df.replace([np.inf, -np.inf], np.nan) - df = df.dropna(how= 'any') + df = df.dropna(how = 'any') #Return a dataframe of just genes and PORC values - columnstokeep = ['Gene'] + [col for col in df.columns if 'porc' in col] + columnstokeep = ['GeneID', 'GeneName'] + [col for col in df.columns if 'porc' in col] df = df[columnstokeep] + print('{0} genes pass read count filter in all files and do not have PORC values of -inf in any file.'.format(len(df))) return df @@ -251,7 +243,7 @@ def multihyp(pvalues): return correctedps -def getpvalues(samp_conds_file, conditionA, conditionB): +def getpvalues(samp_conds_file, conditionA, conditionB, filteredgenes): #each contingency table will be: [[convG, nonconvG], [convnonG, nonconvnonG]] #These will be stored in a dictionary: {gene : [condAtables, condBtables]} conttables = {} @@ -268,10 +260,13 @@ def getpvalues(samp_conds_file, conditionA, conditionB): pigpenfile = line[0] sample = line[1] condition = line[2] - df = pd.read_csv(pigpenfile, sep = '\t', index_col = False, header=0) + df = pd.read_csv(pigpenfile, sep = '\t', index_col = False, header=0, comment = '#') for idx, row in df.iterrows(): conttable = makeContingencyTable(row) - gene = row['Gene'] + gene = row['GeneID'] + #If this isn't one of the genes that passes read count filters in all files, skip it + if gene not in filteredgenes: + continue if gene not in conttables: conttables[gene] = [[], []] if condition == conditionA: @@ -302,11 +297,11 @@ def getpvalues(samp_conds_file, conditionA, conditionB): pdf = pd.DataFrame.from_dict(pvalues, orient = 'index', columns = ['pval']) fdrdf = pd.DataFrame.from_dict(correctedps, orient = 'index', columns = ['FDR']) - pdf = pd.merge(pdf, fdrdf, left_index = True, right_index = True).reset_index().rename(columns = {'index' : 'Gene'}) + pdf = pd.merge(pdf, fdrdf, left_index = True, right_index = True).reset_index().rename(columns = {'index' : 'GeneID'}) return pdf -def formatporcDF(porcdf): +def formatporcdf(porcdf): #Format floats in all porcDF columns formats = {'deltaPORC': '{:.3f}', 'pval': '{:.3e}', 'FDR': '{:.3e}'} c = porcdf.columns.tolist() @@ -324,15 +319,28 @@ def formatporcDF(porcdf): base = importr('base') stats = importr('stats') + parser = argparse.ArgumentParser(description = 'BACON: A framework for analyzing pigpen outputs.') + parser.add_argument('--sampconds', type=str, + help='3 column, tab delimited file. Column names must be \'file\', \'sample\', and \'condition\'. See README for more details.') + parser.add_argument('--minreads', type = int, help = 'Minimum read count for a gene to be considered in a sample.', default = 100) + parser.add_argument('--conditionA', type=str, + help='One of the two conditions in the \'condition\' column of sampconds. Deltaporc is defined as conditionB - conditionA.') + parser.add_argument('--conditionB', type=str, + help='One of the two conditions in the \'condition\' column of sampconds. Deltaporc is defined as conditionB - conditionA.') + parser.add_argument('--output', type = str, help = 'Output file.') + args = parser.parse_args() + #Make df of PORC values - porcdf = makePORCdf(sys.argv[1], 100) - #Add delta porc values - porcdf = calcDeltaPORC(porcdf, sys.argv[1], 'mDBF', 'pDBF') + porcdf = makePORCdf(args.sampconds, args.minreads) + #Add deltaporc values + porcdf = calcDeltaPORC(porcdf, args.sampconds, args.conditionA, args.conditionB) + filteredgenes = porcdf['GeneID'].tolist() #Get p values and corrected p values - pdf = getpvalues(sys.argv[1], 'mDBF', 'pDBF') - #Add p values and FDR - porcdf = pd.merge(porcdf, pdf, on = ['Gene']) + pdf = getpvalues(args.sampconds, args.conditionA, args.conditionB, filteredgenes) + #add p values and FDR + porcdf = pd.merge(porcdf, pdf, on = ['GeneID']) #Format floats - porcdf = formatporcDF(porcdf) + porcdf = formatporcdf(porcdf) + + porcdf.to_csv(args.output, sep = '\t', index = False) - porcdf.to_csv('porc.txt', sep='\t', index = False) From d16ffda0c2b3f52d87700dc5731fad801abbc942 Mon Sep 17 00:00:00 2001 From: Matthew Taliaferro Date: Fri, 16 Sep 2022 09:45:29 -0600 Subject: [PATCH 040/108] add minMappingQual to ROIbed mode --- pigpen.py | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/pigpen.py b/pigpen.py index 7a7993a..2b1d69b 100644 --- a/pigpen.py +++ b/pigpen.py @@ -30,7 +30,7 @@ parser.add_argument('--use_g_c', action = 'store_true', help = 'Consider G->C conversions?') parser.add_argument('--use_read1', action = 'store_true', help = 'Use read1 when looking for conversions?') parser.add_argument('--use_read2', action = 'store_true', help = 'Use read2 when looking for conversions?') - parser.add_argument('--minMappingQual', type = int, help = 'Minimum mapping quality for a read to be considered in conversion counting.') + parser.add_argument('--minMappingQual', type = int, help = 'Minimum mapping quality for a read to be considered in conversion counting. STAR unique mappers have MAPQ 255.', required = True) parser.add_argument('--nConv', type = int, help = 'Minimum number of required G->T and/or G->C conversions in a read pair in order for conversions to be counted. Default is 1.', default = 1) parser.add_argument('--outputDir', type = str, help = 'Output directory.', required = True) args = parser.parse_args() @@ -165,10 +165,10 @@ sampleparams['starbam'] = os.path.abspath(starbam) if args.nproc == 1: convs, readcounter = iteratereads_pairedend( - starbam, args.onlyConsiderOverlap, args.use_g_t, args.use_g_c, args.use_read1, args.use_read2, args.nConv, snps, maskpositions, 'high') + starbam, args.onlyConsiderOverlap, args.use_g_t, args.use_g_c, args.use_read1, args.use_read2, args.nConv, args.minMappingQual, snps, maskpositions, 'high') elif args.nproc > 1: convs = getmismatches(starbam, args.onlyConsiderOverlap, snps, maskpositions, - args.nConv, args.nproc, args.use_g_t, args.use_g_c, args.use_read1, args.use_read2) + args.nConv, args.minMappingQual, args.nproc, args.use_g_t, args.use_g_c, args.use_read1, args.use_read2) print('Assigning reads to genes in supplied bed file...') overlaps, numpairs = getReadOverlaps(starbam, args.ROIbed, 'chrsort.txt') From b846278f5e67d9a5a6eb7ef068e6f305132d7d92 Mon Sep 17 00:00:00 2001 From: Matthew Taliaferro Date: Wed, 4 Jan 2023 14:08:03 -0700 Subject: [PATCH 041/108] change bacon to consider specific conversions --- bacon_glm.py | 80 ++++++++++++++++++++++++++++++++++++++-------------- 1 file changed, 59 insertions(+), 21 deletions(-) diff --git a/bacon_glm.py b/bacon_glm.py index 8f924e9..66e9c7c 100644 --- a/bacon_glm.py +++ b/bacon_glm.py @@ -142,11 +142,16 @@ def calcDeltaPORC(porcdf, sampconds, conditionA, conditionB): return porcdf -def makeContingencyTable(row): +def makeContingencyTable(row, use_g_t, use_g_c): #Given a row from a pigpen df, return a contingency table of the form #[[convG, nonconvG], [convnonG, nonconvnonG]] - convG = row['g_t'] + row['g_c'] + if use_g_t and use_g_c: + convG = row['g_t'] + row['g_c'] + elif use_g_t and not use_g_c: + convG = row['g_t'] + elif use_g_c and not use_g_t: + convG = row['g_c'] nonconvG = row['g_g'] convnonG = row['a_t'] + row['a_c'] + row['a_g'] + row['c_a'] + row['c_t'] + row['c_g'] + row['t_a'] + row['t_c'] + row['t_g'] nonconvnonG = row['c_c'] + row['t_t'] + row['a_a'] @@ -170,11 +175,11 @@ def calculate_nested_f_statistic(small_model, big_model): p_value = stats.f.sf(f_stat, df_numerator, df_denom) return (f_stat, p_value) -def getgenep(geneconttable): +def getgenep(geneconttable, considernonG): #Given a gene-level contingency table of the form: #[condAtables, condBtables], where each individual sample is of the form #[[convG, nonconvG], [convnonG, nonconvnonG]], - #run glm either including or excluding condition term + #run lme either including or excluding condition term #using likelihood ratio of the two models and chisq test, return p value #Turn gene-level contingency tables into df of form @@ -202,23 +207,48 @@ def getgenep(geneconttable): d = {'cond': cond, 'nuc': nuc, 'conv': conv, 'counts': counts, 'sample' : samples} df = pd.DataFrame.from_dict(d) - #Reshape table to get individual columns for converted and nonconverted nts - df2 = df.pivot_table(index = ['cond', 'nuc', 'sample'], columns = 'conv', values = 'counts').reset_index() + + if considernonG: + #Reshape table to get individual columns for converted and nonconverted nts + df2 = df.pivot_table(index=['cond', 'nuc', 'sample'], + columns='conv', values='counts').reset_index() - pandas2ri.activate() + pandas2ri.activate() - fmla = 'cbind(yes, no) ~ nuc + cond + nuc:cond + (1 | sample)' - nullfmla = 'cbind(yes, no) ~ nuc + cond + (1 | sample)' + fmla = 'cbind(yes, no) ~ sample + nuc + cond + nuc:cond' + nullfmla = 'cbind(yes, no) ~ sample + nuc' - fullfit = lme4.glmer(formula=fmla, family=stats.binomial, data=df2) - reducedfit = lme4.glmer(formula=nullfmla, family=stats.binomial, data=df2) + fullfit = stats.glm(formula=fmla, family=stats.binomial, data=df2) + reducedfit = stats.glm(formula=nullfmla, family=stats.binomial, data=df2) - logratio = (stats.logLik(fullfit)[0] - stats.logLik(reducedfit)[0]) * 2 - pvalue = stats.pchisq(logratio, df=2, lower_tail=False)[0] - #format decimal - pvalue = float('{:.2e}'.format(pvalue)) - - return pvalue + logratio = (stats.logLik(fullfit)[0] - stats.logLik(reducedfit)[0]) * 2 + pvalue = stats.pchisq(logratio, df=2, lower_tail=False)[0] + #format decimal + pvalue = float('{:.2e}'.format(pvalue)) + + return pvalue + + elif not considernonG: + #Remove rows in which nuc == nonG + df = df[df.nuc == 'G'] + + #Reshape table to get individual columns for converted and nonconverted nts + df2 = df.pivot_table(index=['cond', 'nuc', 'sample'], + columns='conv', values='counts').reset_index() + + pandas2ri.activate() + fmla = 'cbind(yes, no) ~ cond' + nullfmla = 'cbind(yes, no) ~ 1' + + fullfit = stats.glm(formula=fmla, family=stats.binomial, data=df2) + reducedfit = stats.glm(formula=nullfmla, family=stats.binomial, data=df2) + + logratio = (stats.logLik(fullfit)[0] - stats.logLik(reducedfit)[0]) * 2 + pvalue = stats.pchisq(logratio, df=1, lower_tail=False)[0] + #format decimal + pvalue = float('{:.2e}'.format(pvalue)) + + return pvalue def multihyp(pvalues): #given a dictionary of {gene : pvalue}, perform multiple hypothesis correction @@ -243,7 +273,7 @@ def multihyp(pvalues): return correctedps -def getpvalues(samp_conds_file, conditionA, conditionB, filteredgenes): +def getpvalues(samp_conds_file, conditionA, conditionB, considernonG, filteredgenes, use_g_t, use_g_c): #each contingency table will be: [[convG, nonconvG], [convnonG, nonconvnonG]] #These will be stored in a dictionary: {gene : [condAtables, condBtables]} conttables = {} @@ -262,7 +292,7 @@ def getpvalues(samp_conds_file, conditionA, conditionB, filteredgenes): condition = line[2] df = pd.read_csv(pigpenfile, sep = '\t', index_col = False, header=0, comment = '#') for idx, row in df.iterrows(): - conttable = makeContingencyTable(row) + conttable = makeContingencyTable(row, use_g_t, use_g_c) gene = row['GeneID'] #If this isn't one of the genes that passes read count filters in all files, skip it if gene not in filteredgenes: @@ -284,7 +314,7 @@ def getpvalues(samp_conds_file, conditionA, conditionB, filteredgenes): gene_porcfiles = len(geneconttable[0]) + len(geneconttable[1]) if nsamples == gene_porcfiles: try: - p = getgenep(geneconttable) + p = getgenep(geneconttable, considernonG) except RRuntimeError: p = np.nan else: @@ -327,16 +357,24 @@ def formatporcdf(porcdf): help='One of the two conditions in the \'condition\' column of sampconds. Deltaporc is defined as conditionB - conditionA.') parser.add_argument('--conditionB', type=str, help='One of the two conditions in the \'condition\' column of sampconds. Deltaporc is defined as conditionB - conditionA.') + parser.add_argument('--use_g_t', help = 'Consider G to T mutations when calculating G conversion rate?', action = 'store_true') + parser.add_argument('--use_g_c', help = 'Consider G to C mutations when calculating G conversion rate?', action = 'store_true') + parser.add_argument('--considernonG', + help='Consider conversions of nonG residues to normalize for overall mutation rate?', action = 'store_true') parser.add_argument('--output', type = str, help = 'Output file.') args = parser.parse_args() + if not args.use_g_t and not args.use_g_c: + print('ERROR: we must either count G to T or G to C mutations (or both). Supply --use_g_t or --use_g_c or both.') + sys.exit() + #Make df of PORC values porcdf = makePORCdf(args.sampconds, args.minreads) #Add deltaporc values porcdf = calcDeltaPORC(porcdf, args.sampconds, args.conditionA, args.conditionB) filteredgenes = porcdf['GeneID'].tolist() #Get p values and corrected p values - pdf = getpvalues(args.sampconds, args.conditionA, args.conditionB, filteredgenes) + pdf = getpvalues(args.sampconds, args.conditionA, args.conditionB, args.considernonG, filteredgenes, args.use_g_t, args.use_g_c) #add p values and FDR porcdf = pd.merge(porcdf, pdf, on = ['GeneID']) #Format floats From a2d2aee63b2cf0fd8ed8ad98f5ae2b23000fb70b Mon Sep 17 00:00:00 2001 From: Matthew Taliaferro Date: Fri, 6 Jan 2023 14:37:58 -0700 Subject: [PATCH 042/108] in bacon make deltas specific to metrics --- bacon_glm.py | 108 ++++++++++++++++++++++++++++++++------------------- 1 file changed, 69 insertions(+), 39 deletions(-) diff --git a/bacon_glm.py b/bacon_glm.py index 66e9c7c..22e08a5 100644 --- a/bacon_glm.py +++ b/bacon_glm.py @@ -38,7 +38,7 @@ def readconditions(samp_conds_file): return sampconddf -def makePORCdf(samp_conds_file, minreads): +def makePORCdf(samp_conds_file, minreads, considernonG): #Make a dataframe of PORC values for all samples #start with GENE...SAMPLE...READCOUNT...PORC #then make a wide version that is GENE...SAMPLE1READCOUNT...SAMPLE1PORC...SAMPLE2READCOUNT...SAMPLE2PORC @@ -65,7 +65,7 @@ def makePORCdf(samp_conds_file, minreads): else: genesinall = genesinall.intersection(set(dfgenes)) - columnstokeep = ['GeneID', 'GeneName', 'sample', 'numreads', 'porc'] + columnstokeep = ['GeneID', 'GeneName', 'sample', 'numreads', 'G_Trate', 'G_Crate', 'convGrate', 'porc'] df = df[columnstokeep] dfs.append(df) @@ -79,7 +79,8 @@ def makePORCdf(samp_conds_file, minreads): df = pd.concat(dfs) #turn from long into wide - df = df.pivot_table(index = ['GeneID', 'GeneName'], columns = 'sample', values = ['numreads', 'porc']).reset_index() + df = df.pivot_table(index=['GeneID', 'GeneName'], columns='sample', values=[ + 'numreads', 'G_Trate', 'G_Crate', 'convGrate', 'porc']).reset_index() #flatten multiindex column names df.columns = ["_".join(a) if '' not in a else a[0] for a in df.columns.to_flat_index()] @@ -93,53 +94,70 @@ def makePORCdf(samp_conds_file, minreads): df = df.loc[df['minreadcount'] >= minreads] print('{0} genes have at least {1} reads in every sample.'.format(len(df), minreads)) #We also don't want rows with inf/-inf PORC values - df = df.replace([np.inf, -np.inf], np.nan) - df = df.dropna(how = 'any') - #Return a dataframe of just genes and PORC values - columnstokeep = ['GeneID', 'GeneName'] + [col for col in df.columns if 'porc' in col] + #This is true only if we are using porc values, otherwise we can keep them + if considernonG: + df = df.replace([np.inf, -np.inf], np.nan) + df = df.dropna(how = 'any') + #Return a dataframe of just genes and relevant values + columnstokeep = ['GeneID', 'GeneName'] + [col for col in df.columns if 'rate' in col] + [col for col in df.columns if 'porc' in col] df = df[columnstokeep] - print('{0} genes pass read count filter in all files and do not have PORC values of -inf in any file.'.format(len(df))) + print('{0} genes pass read count filter in all files.'.format(len(df))) return df -def calcDeltaPORC(porcdf, sampconds, conditionA, conditionB): - #Given a porc df from makePORCdf, add deltaporc values. +def calcDeltaPORC(porcdf, sampconds, conditionA, conditionB, metric): + #Given a porc df from makePORCdf, add delta metric values. + #Metric depends on whether we are considering nonG conversions (if so, metric = porc) + #and on what conversion we are considering (g_t, g_c, or if both, metric = convGrate) - deltaporcs = [] + deltametrics = [] sampconddf = readconditions(sampconds) #Get column names in porcdf that are associated with each condition conditionAsamps = sampconddf.loc[sampconddf['condition'] == conditionA] conditionAsamps = conditionAsamps['sample'].tolist() - conditionAcolumns = ['porc_' + samp for samp in conditionAsamps] + conditionAcolumns = [metric + '_' + samp for samp in conditionAsamps] conditionBsamps = sampconddf.loc[sampconddf['condition'] == conditionB] conditionBsamps = conditionBsamps['sample'].tolist() - conditionBcolumns = ['porc_' + samp for samp in conditionBsamps] + conditionBcolumns = [metric + '_' + samp for samp in conditionBsamps] print('Condition A samples: ' + (', ').join(conditionAsamps)) print('Condition B samples: ' + (', ').join(conditionBsamps)) for index, row in porcdf.iterrows(): - condAporcs = [] - condBporcs = [] + condAmetrics = [] + condBmetrics = [] for col in conditionAcolumns: - porc = row[col] - condAporcs.append(porc) + value = row[col] + condAmetrics.append(value) for col in conditionBcolumns: - porc = row[col] - condBporcs.append(porc) - - condAporcs = [x for x in condAporcs if x != np.nan] - condBporcs = [x for x in condBporcs if x != np.nan] - condAporc = np.mean(condAporcs) - condBporc = np.mean(condBporcs) - deltaporc = condBporc - condAporc - deltaporc = float(format(deltaporc, '.3f')) - deltaporcs.append(deltaporc) - - porcdf = porcdf.assign(deltaPORC = deltaporcs) - + value = row[col] + condBmetrics.append(value) + + condAmetrics = [x for x in condAmetrics if x != np.nan] + condBmetrics = [x for x in condBmetrics if x != np.nan] + condAmetric = np.mean(condAmetrics) + condBmetric = np.mean(condBmetrics) + + deltametric = condBmetric - condAmetric #remember that porc is logged, but the raw conversion rates are not + if metric == 'porc': + deltametric = float(format(deltametric, '.3f')) + deltametrics.append(deltametric) + + if metric == 'porc': + porcdf = porcdf.assign(delta_porc = deltametrics) + elif metric == 'G_Trate': + porcdf = porcdf.assign(delta_G_Trate = deltametrics) + elif metric == 'G_Crate': + porcdf = porcdf.assign(delta_G_Crate = deltametrics) + elif metric == 'convGrate': + porcdf = porcdf.assign(delta_convGrate = deltametrics) + + #Only take same columns + columnstokeep = ['GeneID', 'GeneName'] + [col for col in porcdf.columns if metric in col] + porcdf = porcdf[columnstokeep] + return porcdf def makeContingencyTable(row, use_g_t, use_g_c): @@ -333,11 +351,11 @@ def getpvalues(samp_conds_file, conditionA, conditionB, considernonG, filteredge def formatporcdf(porcdf): #Format floats in all porcDF columns - formats = {'deltaPORC': '{:.3f}', 'pval': '{:.3e}', 'FDR': '{:.3e}'} - c = porcdf.columns.tolist() - c = [x for x in c if 'porc_' in x] - for x in c: - formats[x] = '{:.3f}' #all porc_SAMPLE columns + formats = {'pval': '{:.3e}', 'FDR': '{:.3e}'} + #c = porcdf.columns.tolist() + #c = [x for x in c if 'porc_' in x] + #for x in c: + # formats[x] = '{:.3f}' #all porc_SAMPLE columns for col, f in formats.items(): porcdf[col] = porcdf[col].map(lambda x: f.format(x)) @@ -364,14 +382,26 @@ def formatporcdf(porcdf): parser.add_argument('--output', type = str, help = 'Output file.') args = parser.parse_args() - if not args.use_g_t and not args.use_g_c: - print('ERROR: we must either count G to T or G to C mutations (or both). Supply --use_g_t or --use_g_c or both.') + #Considering nonG conversions uses porc values in which both g_t and g_c conversions have already been included + if not args.use_g_t and not args.use_g_c and not args.considernonG: + print('ERROR: we must either count G to T or G to C mutations (or both) or consider nonG conversions.') sys.exit() + #What metric should we care about? + if args.considernonG: + metric == 'porc' + elif args.use_g_t and not args.use_g_c: + metric = 'G_Trate' + elif args.use_g_c and not args.use_g_t: + metric = 'G_Crate' + elif args.use_g_t and args.use_g_c: + metric = 'convGrate' + + #Make df of PORC values - porcdf = makePORCdf(args.sampconds, args.minreads) + porcdf = makePORCdf(args.sampconds, args.minreads, args.considernonG) #Add deltaporc values - porcdf = calcDeltaPORC(porcdf, args.sampconds, args.conditionA, args.conditionB) + porcdf = calcDeltaPORC(porcdf, args.sampconds, args.conditionA, args.conditionB, metric) filteredgenes = porcdf['GeneID'].tolist() #Get p values and corrected p values pdf = getpvalues(args.sampconds, args.conditionA, args.conditionB, args.considernonG, filteredgenes, args.use_g_t, args.use_g_c) From be5c60767819811eff7a902c3afa7ed7cc4afb90 Mon Sep 17 00:00:00 2001 From: Matthew Taliaferro Date: Fri, 6 Jan 2023 14:54:00 -0700 Subject: [PATCH 043/108] minor update to bacon for float formatting --- bacon_glm.py | 2 ++ 1 file changed, 2 insertions(+) diff --git a/bacon_glm.py b/bacon_glm.py index 22e08a5..033a522 100644 --- a/bacon_glm.py +++ b/bacon_glm.py @@ -143,6 +143,8 @@ def calcDeltaPORC(porcdf, sampconds, conditionA, conditionB, metric): deltametric = condBmetric - condAmetric #remember that porc is logged, but the raw conversion rates are not if metric == 'porc': deltametric = float(format(deltametric, '.3f')) + else: + deltametric = '{:.3e}'.format(deltametric) deltametrics.append(deltametric) if metric == 'porc': From 6d6abe5615b223bd797124135ec559b4fc80a947 Mon Sep 17 00:00:00 2001 From: Matthew Taliaferro Date: Thu, 26 Jan 2023 08:54:18 -0700 Subject: [PATCH 044/108] update alignandquant scripts --- alignAndQuant2.py | 3 ++- assignreads_salmon.py | 3 +++ 2 files changed, 5 insertions(+), 1 deletion(-) diff --git a/alignAndQuant2.py b/alignAndQuant2.py index b16a9de..5fca1cb 100644 --- a/alignAndQuant2.py +++ b/alignAndQuant2.py @@ -16,6 +16,8 @@ #Reads are then quantified using salmon, where a separate transcriptome-oriented bam is written. #Postmaster then takes this bam and adds posterior probabilities for transcript assignments. +#alignAndQuant.py only gives uniquely aligned reads to salmon. alignAndQuant2.py gives all reads to salmon. + #When runSTAR(), bamtofastq(), runSalmon(), and runPostmaster() are run in succession, the output is a directory called . #In this directory, the STAR output is Aligned.sortedByCoord.out.bam in STAR/, #the salmon output is .quant.sf and .salmon.bam in salmon/, @@ -84,7 +86,6 @@ def bamtofastq(samplename, nthreads): def runSalmon(reads1, reads2, nthreads, salmonindex, samplename): - #Take in those uniquely aligning reads and quantify transcript abundance with them using salmon. if not os.path.exists('salmon'): os.mkdir('salmon') diff --git a/assignreads_salmon.py b/assignreads_salmon.py index 05234aa..65f5c69 100644 --- a/assignreads_salmon.py +++ b/assignreads_salmon.py @@ -40,6 +40,7 @@ def assigntotxs(pprobs, convs): #convs = #{readid : {a_a : 200, a_t : 1, etc.}} print('Finding transcript assignments for {0} reads.'.format(len(convs))) readswithoutassignment = 0 #number of reads which exist in convs but not in pprobs (i.e. weren't assigned to a transcript by salmon) + assignedreads = 0 #number of reads in convs for which we found a match in pprobs txconvs = {} # {txid : {a_a : 200, a_t : 1, etc.}} @@ -47,6 +48,7 @@ def assigntotxs(pprobs, convs): try: readconvs = convs[readid] + assignedreads +=1 except KeyError: #we couldn't find this read in convs readswithoutassignment +=1 continue @@ -64,6 +66,7 @@ def assigntotxs(pprobs, convs): txconvs[txid][conv] += scaledconv readswithtxs = len(convs) - readswithoutassignment + readswithtxs = assignedreads pct = round(readswithtxs / len(convs), 2) * 100 print('Found transcripts for {0} of {1} reads ({2}%).'.format(readswithtxs, len(convs), pct)) From 5525fdf3b5cfd33f229a62e11bf700d895740e82 Mon Sep 17 00:00:00 2001 From: goeringr Date: Mon, 6 Feb 2023 12:07:21 -0700 Subject: [PATCH 045/108] dedupUMI Including option to deduplicate UMIs as a flag in both AlignUMIquant and Pigpen. This requires reads to have UMIs appended to the read header by UMI_tools extract --- alignUMIquant.py | 230 +++++++++++++++++++++++++++++++++++++++++++++++ pigpen.py | 60 +++++++++---- 2 files changed, 272 insertions(+), 18 deletions(-) create mode 100644 alignUMIquant.py diff --git a/alignUMIquant.py b/alignUMIquant.py new file mode 100644 index 0000000..567bc70 --- /dev/null +++ b/alignUMIquant.py @@ -0,0 +1,230 @@ +import os +import subprocess +import sys +import shutil +import argparse +''' +Given a pair of read files, align reads using STAR, deduplicate reads by UMI, and quantify reads using salmon. +This will make a STAR-produced bam (for pigpen mutation calling) +This bam will be deduplicated with UMI-tools then passed to salmon(for read assignment). +It will then run postmaster to append transcript assignments to the salmon-produced bam. + +This is going to take in gzipped fastqs with UMIs extractsd, +a directory containing the STAR index for this genome, and a directory containing the salmon index for this genome. + +Reads are aligned to the genome using STAR. This bam file will be used for mutation calling. +In this alignment, we allow multiple mapping reads, but only report the best alignment. +This bam will then be deduplicated based on UMI and alignment position. + +Reads are then quantified using salmon, where a separate transcriptome-oriented bam is written. +Postmaster then takes this bam and adds posterior probabilities for transcript assignments. + +When runSTAR(), runSalmon(), and runPostmaster() are run in succession, the output is a directory called . +In this directory, the STAR output is .Aligned.sortedByCoord.out.bam in STAR/, +the salmon output is .quant.sf and .salmon.bam in salmon/, +and the postmaster output is .postmaster.bam in postmaster/ + +Requires STAR, salmon(>= 1.9.0), and postmaster be in user's PATH. +''' + +def runSTAR(reads1, reads2, nthreads, STARindex, samplename): + if not os.path.exists('STAR'): + os.mkdir('STAR') + + cwd = os.getcwd() + outdir = os.path.join(cwd, 'STAR') + + #Clean output directory if it already exists + if os.path.exists(outdir) and os.path.isdir(outdir): + shutil.rmtree(outdir) + + os.mkdir(outdir) + prefix = outdir + '/' + samplename + + command = ['STAR', '--runMode', 'alignReads', '--runThreadN', nthreads, '--genomeLoad', 'NoSharedMemory', '--genomeDir', STARindex, '--readFilesIn', reads1, reads2, '--readFilesCommand', + 'zcat', '--outFileNamePrefix', prefix, '--outSAMtype', 'BAM', 'SortedByCoordinate', '--outSAMstrandField', 'intronMotif', '--outSAMmultNmax', '1', '--outSAMattributes', 'MD', 'NH'] + + print('Running STAR for {0}...'.format(samplename)) + + subprocess.run(command) + + #make index + bam = os.path.join(outdir, samplename + 'Aligned.sortedByCoord.out.bam') + bamindex = bam + '.bai' + if not os.path.exists(bamindex): + indexCMD = 'samtools index ' + bam + index = subprocess.Popen(indexCMD, shell=True) + index.wait() + + print('Finished STAR for {0}!'.format(samplename)) + + +def runDedup(samplename, nthreads): + STARbam = os.path.join(os.getcwd(), 'STAR', '{0}Aligned.sortedByCoord.out.bam'.format(samplename)) + dedupbam = os.path.join(os.getcwd(), 'STAR', '{0}.dedup.bam'.format(samplename)) + command = ['umi_tools', 'dedup', '-I', STARbam, '--paired', ' --output-stats=deduplicated', '-S', dedupbam] + + print('Running deduplication for {0}...'.format(samplename)) + + subprocess.run(command) + + command = ['samtools', 'index', dedupbam] + subprocess.run(command) + + #We don't need the STAR alignment file anymore, and it's pretty big + # os.remove(STARbam) + + print('Finished deduplicating {0}!'.format(samplename)) + + +def bamtofastq(samplename, nthreads, dedup): + #Given a bam file of uniquely aligned reads (produced from runSTAR), rederive these reads as fastq in preparation for submission to salmon + if not os.path.exists('STAR'): + os.mkdir('STAR') + + cwd = os.getcwd() + outdir = os.path.join(cwd, 'STAR') + if dedup: + inbam = os.path.join(outdir, samplename + '.dedup.bam') + else: + inbam = os.path.join(outdir, samplename + 'Aligned.sortedByCoord.out.bam') + sortedbam = os.path.join(outdir, 'temp.namesort.bam') + + #First sort bam file by readname + print('Sorting bam file by read name...') + command = ['samtools', 'collate', '--threads', nthreads, '-u', '-o', sortedbam, inbam] + subprocess.call(command) + print('Done!') + + #Now derive fastq + if dedup: + r1file = samplename + '.dedup.r1.fq.gz' + r2file = samplename + '.dedup.r2.fq.gz' + else: + r1file = samplename + '.STARaligned.r1.fq.gz' + r2file = samplename + '.STARaligned.r2.fq.gz' + print('Writing fastq file of deduplicated reads for {0}...'.format(samplename)) + command = ['samtools', 'fastq', '--threads', nthreads, '-1', r1file, '-2', r2file, '-0', '/dev/null', '-s', '/dev/null', '-n', sortedbam] + subprocess.call(command) + print('Done writing fastq files for {0}!'.format(samplename)) + + os.remove(sortedbam) + + +def runSalmon(reads1, reads2, nthreads, salmonindex, samplename): + #Take in those deduplicated reads and quantify transcript abundance with them using salmon. + + if not os.path.exists('salmon'): + os.mkdir('salmon') + + idx = os.path.abspath(salmonindex) + r1 = os.path.abspath(reads1) + r2 = os.path.abspath(reads2) + + os.chdir('salmon') + + command = ['salmon', 'quant', '--libType', 'A', '-p', nthreads, '--seqBias', '--gcBias', + '--validateMappings', '-1', r1, '-2', r2, '-o', samplename, '--index', idx, '--writeMappings={0}.salmon.bam'.format(samplename), '--writeQualities'] + + print('Running salmon for {0}...'.format(samplename)) + + subprocess.call(command) + + #Move output + outputdir = os.path.join(os.getcwd(), samplename) + quantfile = os.path.join(outputdir, 'quant.sf') + movedquantfile = os.path.join(os.getcwd(), '{0}.quant.sf'.format(samplename)) + os.rename(quantfile, movedquantfile) + + #Remove uniquely aligning read files + os.remove(r1) + os.remove(r2) + + print('Finished salmon for {0}!'.format(samplename)) + + + + +def runPostmaster(samplename, nthreads): + if not os.path.exists('postmaster'): + os.mkdir('postmaster') + + salmonquant = os.path.join(os.getcwd(), 'salmon', '{0}.quant.sf'.format(samplename)) + salmonbam = os.path.join(os.getcwd(), 'salmon', '{0}.salmon.bam'.format(samplename)) + + os.chdir('postmaster') + outputfile = os.path.join(os.getcwd(), '{0}.postmaster.bam'.format(samplename)) + + print('Running postmaster for {0}...'.format(samplename)) + command = ['postmaster', '--num-threads', nthreads, '--quant', salmonquant, '--alignments', salmonbam, '--output', outputfile] + subprocess.call(command) + + #Sort and index bam + with open(outputfile + '.sort', 'w') as sortedfh: + command = ['samtools', 'sort', '-@', nthreads, outputfile] + subprocess.run(command, stdout = sortedfh) + + os.rename(outputfile + '.sort', outputfile) + + command = ['samtools', 'index', outputfile] + subprocess.run(command) + + #We don't need the salmon alignment file anymore, and it's pretty big + os.remove(salmonbam) + + print('Finished postmaster for {0}!'.format(samplename)) + + +def addMD(samplename, reffasta, nthreads): + inputbam = os.path.join(os.getcwd(), 'postmaster', '{0}.postmaster.bam'.format(samplename)) + command = ['samtools', 'calmd', '-b', '--threads', nthreads, inputbam, reffasta] + + print('Adding MD tags to {0}.postmaster.md.bam...'.format(samplename)) + with open(samplename + '.postmaster.md.bam', 'w') as outfile: + subprocess.run(command, stdout = outfile) + print('Finished adding MD tags to {0}.postmaster.md.bam!'.format(samplename)) + +if __name__ == '__main__': + parser = argparse.ArgumentParser(description = 'Align and quantify reads using STAR, salmon, and postmaster in preparation for analysis with PIGPEN.') + parser.add_argument('--forwardreads', type = str, help = 'Forward reads. Gzipped fastq.') + parser.add_argument('--reversereads', type = str, help = 'Reverse reads. Gzipped fastq.') + parser.add_argument('--nthreads', type = str, help = 'Number of threads to use for alignment and quantification.') + parser.add_argument('--STARindex', type = str, help = 'STAR index directory.') + parser.add_argument('--salmonindex', type = str, help = 'Salmon index directory.') + parser.add_argument('--samplename', type = str, help = 'Sample name. Will be appended to output files.') + parser.add_argument('--dedupUMI', action = 'store_true', help = 'Deduplicate UMIs? requires UMI extract.') + args = parser.parse_args() + + r1 = os.path.abspath(args.forwardreads) + r2 = os.path.abspath(args.reversereads) + STARindex = os.path.abspath(args.STARindex) + salmonindex = os.path.abspath(args.salmonindex) + samplename = args.samplename + nthreads = args.nthreads + + wd = os.path.abspath(os.getcwd()) + sampledir = os.path.join(wd, samplename) + if os.path.exists(sampledir) and os.path.isdir(sampledir): + shutil.rmtree(sampledir) + os.mkdir(sampledir) + os.chdir(sampledir) + + #uniquely aligning read files + if args.dedupUMI: + salmonR1 = samplename + '.dedup.r1.fq.gz' + salmonR2 = samplename + '.dedup.r2.fq.gz' + else: + salmonR1 = samplename + '.STARaligned.r1.fq.gz' + salmonR2 = samplename + '.STARaligned.r2.fq.gz' + + runSTAR(r1, r2, nthreads, STARindex, samplename) + if args.dedupUMI: + runDedup(samplename, nthreads) + bamtofastq(samplename, nthreads, args.dedupUMI) + runSalmon(salmonR1, salmonR2, nthreads, salmonindex, samplename) + os.chdir(sampledir) + runPostmaster(samplename, nthreads) + + + + diff --git a/pigpen.py b/pigpen.py index 2b1d69b..34fe04a 100644 --- a/pigpen.py +++ b/pigpen.py @@ -33,6 +33,7 @@ parser.add_argument('--minMappingQual', type = int, help = 'Minimum mapping quality for a read to be considered in conversion counting. STAR unique mappers have MAPQ 255.', required = True) parser.add_argument('--nConv', type = int, help = 'Minimum number of required G->T and/or G->C conversions in a read pair in order for conversions to be counted. Default is 1.', default = 1) parser.add_argument('--outputDir', type = str, help = 'Output directory.', required = True) + parser.add_argument('--dedupUMI', action = 'store_true', help = 'Deduplicate UMIs? requires UMI extract.') args = parser.parse_args() #Store command line arguments @@ -46,6 +47,7 @@ samplenames = args.samplenames.split(',') salmonquants = [os.path.join(x, 'salmon', '{0}.quant.sf'.format(x)) for x in samplenames] starbams = [os.path.join(x, 'STAR', '{0}Aligned.sortedByCoord.out.bam'.format(x)) for x in samplenames] + dedupbams = [os.path.join(x, 'STAR', '{0}.dedup.bam'.format(x)) for x in samplenames] postmasterbams = [os.path.join(x, 'postmaster', '{0}.postmaster.bam'.format(x)) for x in samplenames] #Take in list of control samples, make list of their corresponding star bams for SNP calling @@ -56,9 +58,13 @@ if x in controlsamples: controlindicies.append(ind) - controlstarbams = [] - for x in controlindicies: - controlstarbams.append(starbams[x]) + controlsamplebams = [] + if args.dedupUMI: + for x in controlindicies: + controlsamplebams.append(dedupbams[x]) + else: + for x in controlindicies: + controlsamplebams.append(starbams[x]) #We have to be either looking for G->T or G->C, if not both if not args.use_g_t and not args.use_g_c: @@ -84,7 +90,7 @@ if args.useSNPs and not args.snpfile: if not os.path.exists('snps'): os.mkdir('snps') - vcfFileNames = getSNPs(controlstarbams, args.genomeFasta, args.SNPcoverage, args.SNPfreq) + vcfFileNames = getSNPs(controlsamplebams, args.genomeFasta, args.SNPcoverage, args.SNPfreq) for f in vcfFileNames: csi = f + '.csi' log = f[:-3] + '.log' @@ -117,12 +123,20 @@ sampleparams['sample'] = sample print('Running PIGPEN for {0}...'.format(sample)) - starbam = starbams[ind] - sampleparams['starbam'] = os.path.abspath(starbam) - if args.nproc == 1: - convs, readcounter = iteratereads_pairedend(starbam, args.onlyConsiderOverlap, args.use_g_t, args.use_g_c, args.use_read1, args.use_read2, args.nConv, args.minMappingQual, snps, maskpositions, 'high') - elif args.nproc > 1: - convs = getmismatches(starbam, args.onlyConsiderOverlap, snps, maskpositions, args.nConv, args.minMappingQual, args.nproc, args.use_g_t, args.use_g_c, args.use_read1, args.use_read2) + if args.dedupUMI: + dedupbam = dedupbams[ind] + sampleparams['dedupbam'] = os.path.abspath(dedupbam) + if args.nproc == 1: + convs, readcounter = iteratereads_pairedend(dedupbam, args.onlyConsiderOverlap, args.use_g_t, args.use_g_c, args.use_read1, args.use_read2, args.nConv, args.minMappingQual, snps, maskpositions, 'high') + elif args.nproc > 1: + convs = getmismatches(dedupbam, args.onlyConsiderOverlap, snps, maskpositions, args.nConv, args.minMappingQual, args.nproc, args.use_g_t, args.use_g_c, args.use_read1, args.use_read2) + else: + starbam = starbams[ind] + sampleparams['starbam'] = os.path.abspath(starbam) + if args.nproc == 1: + convs, readcounter = iteratereads_pairedend(starbam, args.onlyConsiderOverlap, args.use_g_t, args.use_g_c, args.use_read1, args.use_read2, args.nConv, args.minMappingQual, snps, maskpositions, 'high') + elif args.nproc > 1: + convs = getmismatches(starbam, args.onlyConsiderOverlap, snps, maskpositions, args.nConv, args.minMappingQual, args.nproc, args.use_g_t, args.use_g_c, args.use_read1, args.use_read2) print('Getting posterior probabilities from salmon alignment file...') postmasterbam = postmasterbams[ind] @@ -161,14 +175,24 @@ sampleparams['sample'] = sample print('Running PIGPEN for {0}...'.format(sample)) - starbam = starbams[ind] - sampleparams['starbam'] = os.path.abspath(starbam) - if args.nproc == 1: - convs, readcounter = iteratereads_pairedend( - starbam, args.onlyConsiderOverlap, args.use_g_t, args.use_g_c, args.use_read1, args.use_read2, args.nConv, args.minMappingQual, snps, maskpositions, 'high') - elif args.nproc > 1: - convs = getmismatches(starbam, args.onlyConsiderOverlap, snps, maskpositions, - args.nConv, args.minMappingQual, args.nproc, args.use_g_t, args.use_g_c, args.use_read1, args.use_read2) + if args.dedupUMI: + dedupbam = dedupbams[ind] + sampleparams['dedupbam'] = os.path.abspath(starbam) + if args.nproc == 1: + convs, readcounter = iteratereads_pairedend( + dedupbam, args.onlyConsiderOverlap, args.use_g_t, args.use_g_c, args.use_read1, args.use_read2, args.nConv, args.minMappingQual, snps, maskpositions, 'high') + elif args.nproc > 1: + convs = getmismatches(dedupbam, args.onlyConsiderOverlap, snps, maskpositions, + args.nConv, args.minMappingQual, args.nproc, args.use_g_t, args.use_g_c, args.use_read1, args.use_read2) + else: + starbam = starbams[ind] + sampleparams['starbam'] = os.path.abspath(starbam) + if args.nproc == 1: + convs, readcounter = iteratereads_pairedend( + starbam, args.onlyConsiderOverlap, args.use_g_t, args.use_g_c, args.use_read1, args.use_read2, args.nConv, args.minMappingQual, snps, maskpositions, 'high') + elif args.nproc > 1: + convs = getmismatches(starbam, args.onlyConsiderOverlap, snps, maskpositions, + args.nConv, args.minMappingQual, args.nproc, args.use_g_t, args.use_g_c, args.use_read1, args.use_read2) print('Assigning reads to genes in supplied bed file...') overlaps, numpairs = getReadOverlaps(starbam, args.ROIbed, 'chrsort.txt') From 841373c8eea2aded9ef09de7dc59c1e230e13c10 Mon Sep 17 00:00:00 2001 From: goeringr Date: Mon, 6 Feb 2023 12:15:07 -0700 Subject: [PATCH 046/108] Revert "dedupUMI" This reverts commit 5525fdf3b5cfd33f229a62e11bf700d895740e82. --- alignUMIquant.py | 230 ----------------------------------------------- pigpen.py | 60 ++++--------- 2 files changed, 18 insertions(+), 272 deletions(-) delete mode 100644 alignUMIquant.py diff --git a/alignUMIquant.py b/alignUMIquant.py deleted file mode 100644 index 567bc70..0000000 --- a/alignUMIquant.py +++ /dev/null @@ -1,230 +0,0 @@ -import os -import subprocess -import sys -import shutil -import argparse -''' -Given a pair of read files, align reads using STAR, deduplicate reads by UMI, and quantify reads using salmon. -This will make a STAR-produced bam (for pigpen mutation calling) -This bam will be deduplicated with UMI-tools then passed to salmon(for read assignment). -It will then run postmaster to append transcript assignments to the salmon-produced bam. - -This is going to take in gzipped fastqs with UMIs extractsd, -a directory containing the STAR index for this genome, and a directory containing the salmon index for this genome. - -Reads are aligned to the genome using STAR. This bam file will be used for mutation calling. -In this alignment, we allow multiple mapping reads, but only report the best alignment. -This bam will then be deduplicated based on UMI and alignment position. - -Reads are then quantified using salmon, where a separate transcriptome-oriented bam is written. -Postmaster then takes this bam and adds posterior probabilities for transcript assignments. - -When runSTAR(), runSalmon(), and runPostmaster() are run in succession, the output is a directory called . -In this directory, the STAR output is .Aligned.sortedByCoord.out.bam in STAR/, -the salmon output is .quant.sf and .salmon.bam in salmon/, -and the postmaster output is .postmaster.bam in postmaster/ - -Requires STAR, salmon(>= 1.9.0), and postmaster be in user's PATH. -''' - -def runSTAR(reads1, reads2, nthreads, STARindex, samplename): - if not os.path.exists('STAR'): - os.mkdir('STAR') - - cwd = os.getcwd() - outdir = os.path.join(cwd, 'STAR') - - #Clean output directory if it already exists - if os.path.exists(outdir) and os.path.isdir(outdir): - shutil.rmtree(outdir) - - os.mkdir(outdir) - prefix = outdir + '/' + samplename - - command = ['STAR', '--runMode', 'alignReads', '--runThreadN', nthreads, '--genomeLoad', 'NoSharedMemory', '--genomeDir', STARindex, '--readFilesIn', reads1, reads2, '--readFilesCommand', - 'zcat', '--outFileNamePrefix', prefix, '--outSAMtype', 'BAM', 'SortedByCoordinate', '--outSAMstrandField', 'intronMotif', '--outSAMmultNmax', '1', '--outSAMattributes', 'MD', 'NH'] - - print('Running STAR for {0}...'.format(samplename)) - - subprocess.run(command) - - #make index - bam = os.path.join(outdir, samplename + 'Aligned.sortedByCoord.out.bam') - bamindex = bam + '.bai' - if not os.path.exists(bamindex): - indexCMD = 'samtools index ' + bam - index = subprocess.Popen(indexCMD, shell=True) - index.wait() - - print('Finished STAR for {0}!'.format(samplename)) - - -def runDedup(samplename, nthreads): - STARbam = os.path.join(os.getcwd(), 'STAR', '{0}Aligned.sortedByCoord.out.bam'.format(samplename)) - dedupbam = os.path.join(os.getcwd(), 'STAR', '{0}.dedup.bam'.format(samplename)) - command = ['umi_tools', 'dedup', '-I', STARbam, '--paired', ' --output-stats=deduplicated', '-S', dedupbam] - - print('Running deduplication for {0}...'.format(samplename)) - - subprocess.run(command) - - command = ['samtools', 'index', dedupbam] - subprocess.run(command) - - #We don't need the STAR alignment file anymore, and it's pretty big - # os.remove(STARbam) - - print('Finished deduplicating {0}!'.format(samplename)) - - -def bamtofastq(samplename, nthreads, dedup): - #Given a bam file of uniquely aligned reads (produced from runSTAR), rederive these reads as fastq in preparation for submission to salmon - if not os.path.exists('STAR'): - os.mkdir('STAR') - - cwd = os.getcwd() - outdir = os.path.join(cwd, 'STAR') - if dedup: - inbam = os.path.join(outdir, samplename + '.dedup.bam') - else: - inbam = os.path.join(outdir, samplename + 'Aligned.sortedByCoord.out.bam') - sortedbam = os.path.join(outdir, 'temp.namesort.bam') - - #First sort bam file by readname - print('Sorting bam file by read name...') - command = ['samtools', 'collate', '--threads', nthreads, '-u', '-o', sortedbam, inbam] - subprocess.call(command) - print('Done!') - - #Now derive fastq - if dedup: - r1file = samplename + '.dedup.r1.fq.gz' - r2file = samplename + '.dedup.r2.fq.gz' - else: - r1file = samplename + '.STARaligned.r1.fq.gz' - r2file = samplename + '.STARaligned.r2.fq.gz' - print('Writing fastq file of deduplicated reads for {0}...'.format(samplename)) - command = ['samtools', 'fastq', '--threads', nthreads, '-1', r1file, '-2', r2file, '-0', '/dev/null', '-s', '/dev/null', '-n', sortedbam] - subprocess.call(command) - print('Done writing fastq files for {0}!'.format(samplename)) - - os.remove(sortedbam) - - -def runSalmon(reads1, reads2, nthreads, salmonindex, samplename): - #Take in those deduplicated reads and quantify transcript abundance with them using salmon. - - if not os.path.exists('salmon'): - os.mkdir('salmon') - - idx = os.path.abspath(salmonindex) - r1 = os.path.abspath(reads1) - r2 = os.path.abspath(reads2) - - os.chdir('salmon') - - command = ['salmon', 'quant', '--libType', 'A', '-p', nthreads, '--seqBias', '--gcBias', - '--validateMappings', '-1', r1, '-2', r2, '-o', samplename, '--index', idx, '--writeMappings={0}.salmon.bam'.format(samplename), '--writeQualities'] - - print('Running salmon for {0}...'.format(samplename)) - - subprocess.call(command) - - #Move output - outputdir = os.path.join(os.getcwd(), samplename) - quantfile = os.path.join(outputdir, 'quant.sf') - movedquantfile = os.path.join(os.getcwd(), '{0}.quant.sf'.format(samplename)) - os.rename(quantfile, movedquantfile) - - #Remove uniquely aligning read files - os.remove(r1) - os.remove(r2) - - print('Finished salmon for {0}!'.format(samplename)) - - - - -def runPostmaster(samplename, nthreads): - if not os.path.exists('postmaster'): - os.mkdir('postmaster') - - salmonquant = os.path.join(os.getcwd(), 'salmon', '{0}.quant.sf'.format(samplename)) - salmonbam = os.path.join(os.getcwd(), 'salmon', '{0}.salmon.bam'.format(samplename)) - - os.chdir('postmaster') - outputfile = os.path.join(os.getcwd(), '{0}.postmaster.bam'.format(samplename)) - - print('Running postmaster for {0}...'.format(samplename)) - command = ['postmaster', '--num-threads', nthreads, '--quant', salmonquant, '--alignments', salmonbam, '--output', outputfile] - subprocess.call(command) - - #Sort and index bam - with open(outputfile + '.sort', 'w') as sortedfh: - command = ['samtools', 'sort', '-@', nthreads, outputfile] - subprocess.run(command, stdout = sortedfh) - - os.rename(outputfile + '.sort', outputfile) - - command = ['samtools', 'index', outputfile] - subprocess.run(command) - - #We don't need the salmon alignment file anymore, and it's pretty big - os.remove(salmonbam) - - print('Finished postmaster for {0}!'.format(samplename)) - - -def addMD(samplename, reffasta, nthreads): - inputbam = os.path.join(os.getcwd(), 'postmaster', '{0}.postmaster.bam'.format(samplename)) - command = ['samtools', 'calmd', '-b', '--threads', nthreads, inputbam, reffasta] - - print('Adding MD tags to {0}.postmaster.md.bam...'.format(samplename)) - with open(samplename + '.postmaster.md.bam', 'w') as outfile: - subprocess.run(command, stdout = outfile) - print('Finished adding MD tags to {0}.postmaster.md.bam!'.format(samplename)) - -if __name__ == '__main__': - parser = argparse.ArgumentParser(description = 'Align and quantify reads using STAR, salmon, and postmaster in preparation for analysis with PIGPEN.') - parser.add_argument('--forwardreads', type = str, help = 'Forward reads. Gzipped fastq.') - parser.add_argument('--reversereads', type = str, help = 'Reverse reads. Gzipped fastq.') - parser.add_argument('--nthreads', type = str, help = 'Number of threads to use for alignment and quantification.') - parser.add_argument('--STARindex', type = str, help = 'STAR index directory.') - parser.add_argument('--salmonindex', type = str, help = 'Salmon index directory.') - parser.add_argument('--samplename', type = str, help = 'Sample name. Will be appended to output files.') - parser.add_argument('--dedupUMI', action = 'store_true', help = 'Deduplicate UMIs? requires UMI extract.') - args = parser.parse_args() - - r1 = os.path.abspath(args.forwardreads) - r2 = os.path.abspath(args.reversereads) - STARindex = os.path.abspath(args.STARindex) - salmonindex = os.path.abspath(args.salmonindex) - samplename = args.samplename - nthreads = args.nthreads - - wd = os.path.abspath(os.getcwd()) - sampledir = os.path.join(wd, samplename) - if os.path.exists(sampledir) and os.path.isdir(sampledir): - shutil.rmtree(sampledir) - os.mkdir(sampledir) - os.chdir(sampledir) - - #uniquely aligning read files - if args.dedupUMI: - salmonR1 = samplename + '.dedup.r1.fq.gz' - salmonR2 = samplename + '.dedup.r2.fq.gz' - else: - salmonR1 = samplename + '.STARaligned.r1.fq.gz' - salmonR2 = samplename + '.STARaligned.r2.fq.gz' - - runSTAR(r1, r2, nthreads, STARindex, samplename) - if args.dedupUMI: - runDedup(samplename, nthreads) - bamtofastq(samplename, nthreads, args.dedupUMI) - runSalmon(salmonR1, salmonR2, nthreads, salmonindex, samplename) - os.chdir(sampledir) - runPostmaster(samplename, nthreads) - - - - diff --git a/pigpen.py b/pigpen.py index 34fe04a..2b1d69b 100644 --- a/pigpen.py +++ b/pigpen.py @@ -33,7 +33,6 @@ parser.add_argument('--minMappingQual', type = int, help = 'Minimum mapping quality for a read to be considered in conversion counting. STAR unique mappers have MAPQ 255.', required = True) parser.add_argument('--nConv', type = int, help = 'Minimum number of required G->T and/or G->C conversions in a read pair in order for conversions to be counted. Default is 1.', default = 1) parser.add_argument('--outputDir', type = str, help = 'Output directory.', required = True) - parser.add_argument('--dedupUMI', action = 'store_true', help = 'Deduplicate UMIs? requires UMI extract.') args = parser.parse_args() #Store command line arguments @@ -47,7 +46,6 @@ samplenames = args.samplenames.split(',') salmonquants = [os.path.join(x, 'salmon', '{0}.quant.sf'.format(x)) for x in samplenames] starbams = [os.path.join(x, 'STAR', '{0}Aligned.sortedByCoord.out.bam'.format(x)) for x in samplenames] - dedupbams = [os.path.join(x, 'STAR', '{0}.dedup.bam'.format(x)) for x in samplenames] postmasterbams = [os.path.join(x, 'postmaster', '{0}.postmaster.bam'.format(x)) for x in samplenames] #Take in list of control samples, make list of their corresponding star bams for SNP calling @@ -58,13 +56,9 @@ if x in controlsamples: controlindicies.append(ind) - controlsamplebams = [] - if args.dedupUMI: - for x in controlindicies: - controlsamplebams.append(dedupbams[x]) - else: - for x in controlindicies: - controlsamplebams.append(starbams[x]) + controlstarbams = [] + for x in controlindicies: + controlstarbams.append(starbams[x]) #We have to be either looking for G->T or G->C, if not both if not args.use_g_t and not args.use_g_c: @@ -90,7 +84,7 @@ if args.useSNPs and not args.snpfile: if not os.path.exists('snps'): os.mkdir('snps') - vcfFileNames = getSNPs(controlsamplebams, args.genomeFasta, args.SNPcoverage, args.SNPfreq) + vcfFileNames = getSNPs(controlstarbams, args.genomeFasta, args.SNPcoverage, args.SNPfreq) for f in vcfFileNames: csi = f + '.csi' log = f[:-3] + '.log' @@ -123,20 +117,12 @@ sampleparams['sample'] = sample print('Running PIGPEN for {0}...'.format(sample)) - if args.dedupUMI: - dedupbam = dedupbams[ind] - sampleparams['dedupbam'] = os.path.abspath(dedupbam) - if args.nproc == 1: - convs, readcounter = iteratereads_pairedend(dedupbam, args.onlyConsiderOverlap, args.use_g_t, args.use_g_c, args.use_read1, args.use_read2, args.nConv, args.minMappingQual, snps, maskpositions, 'high') - elif args.nproc > 1: - convs = getmismatches(dedupbam, args.onlyConsiderOverlap, snps, maskpositions, args.nConv, args.minMappingQual, args.nproc, args.use_g_t, args.use_g_c, args.use_read1, args.use_read2) - else: - starbam = starbams[ind] - sampleparams['starbam'] = os.path.abspath(starbam) - if args.nproc == 1: - convs, readcounter = iteratereads_pairedend(starbam, args.onlyConsiderOverlap, args.use_g_t, args.use_g_c, args.use_read1, args.use_read2, args.nConv, args.minMappingQual, snps, maskpositions, 'high') - elif args.nproc > 1: - convs = getmismatches(starbam, args.onlyConsiderOverlap, snps, maskpositions, args.nConv, args.minMappingQual, args.nproc, args.use_g_t, args.use_g_c, args.use_read1, args.use_read2) + starbam = starbams[ind] + sampleparams['starbam'] = os.path.abspath(starbam) + if args.nproc == 1: + convs, readcounter = iteratereads_pairedend(starbam, args.onlyConsiderOverlap, args.use_g_t, args.use_g_c, args.use_read1, args.use_read2, args.nConv, args.minMappingQual, snps, maskpositions, 'high') + elif args.nproc > 1: + convs = getmismatches(starbam, args.onlyConsiderOverlap, snps, maskpositions, args.nConv, args.minMappingQual, args.nproc, args.use_g_t, args.use_g_c, args.use_read1, args.use_read2) print('Getting posterior probabilities from salmon alignment file...') postmasterbam = postmasterbams[ind] @@ -175,24 +161,14 @@ sampleparams['sample'] = sample print('Running PIGPEN for {0}...'.format(sample)) - if args.dedupUMI: - dedupbam = dedupbams[ind] - sampleparams['dedupbam'] = os.path.abspath(starbam) - if args.nproc == 1: - convs, readcounter = iteratereads_pairedend( - dedupbam, args.onlyConsiderOverlap, args.use_g_t, args.use_g_c, args.use_read1, args.use_read2, args.nConv, args.minMappingQual, snps, maskpositions, 'high') - elif args.nproc > 1: - convs = getmismatches(dedupbam, args.onlyConsiderOverlap, snps, maskpositions, - args.nConv, args.minMappingQual, args.nproc, args.use_g_t, args.use_g_c, args.use_read1, args.use_read2) - else: - starbam = starbams[ind] - sampleparams['starbam'] = os.path.abspath(starbam) - if args.nproc == 1: - convs, readcounter = iteratereads_pairedend( - starbam, args.onlyConsiderOverlap, args.use_g_t, args.use_g_c, args.use_read1, args.use_read2, args.nConv, args.minMappingQual, snps, maskpositions, 'high') - elif args.nproc > 1: - convs = getmismatches(starbam, args.onlyConsiderOverlap, snps, maskpositions, - args.nConv, args.minMappingQual, args.nproc, args.use_g_t, args.use_g_c, args.use_read1, args.use_read2) + starbam = starbams[ind] + sampleparams['starbam'] = os.path.abspath(starbam) + if args.nproc == 1: + convs, readcounter = iteratereads_pairedend( + starbam, args.onlyConsiderOverlap, args.use_g_t, args.use_g_c, args.use_read1, args.use_read2, args.nConv, args.minMappingQual, snps, maskpositions, 'high') + elif args.nproc > 1: + convs = getmismatches(starbam, args.onlyConsiderOverlap, snps, maskpositions, + args.nConv, args.minMappingQual, args.nproc, args.use_g_t, args.use_g_c, args.use_read1, args.use_read2) print('Assigning reads to genes in supplied bed file...') overlaps, numpairs = getReadOverlaps(starbam, args.ROIbed, 'chrsort.txt') From 783a68e4158621c16d27995a7e3aa7d0657fa772 Mon Sep 17 00:00:00 2001 From: goeringr Date: Mon, 6 Feb 2023 12:26:17 -0700 Subject: [PATCH 047/108] dedupUMI Including option to deduplicate UMIs as a flag in both AlignUMIquant and Pigpen. This requires reads to have UMIs appended to the read header by UMI_tools extract --- alignUMIquant.py | 224 +++++++++++++++++++++++++++++++++++++++++++++++ pigpen.py | 66 +++++++++----- 2 files changed, 269 insertions(+), 21 deletions(-) create mode 100644 alignUMIquant.py diff --git a/alignUMIquant.py b/alignUMIquant.py new file mode 100644 index 0000000..b881e7e --- /dev/null +++ b/alignUMIquant.py @@ -0,0 +1,224 @@ +import os +import subprocess +import sys +import shutil +import argparse +''' +Given a pair of read files, align reads using STAR, deduplicate reads by UMI, and quantify reads using salmon. +This will make a STAR-produced bam (for pigpen mutation calling) +This bam will be deduplicated with UMI-tools then passed to salmon(for read assignment). +It will then run postmaster to append transcript assignments to the salmon-produced bam. + +This is going to take in gzipped fastqs with UMIs extractsd, +a directory containing the STAR index for this genome, and a directory containing the salmon index for this genome. + +Reads are aligned to the genome using STAR. This bam file will be used for mutation calling. +In this alignment, we allow multiple mapping reads, but only report the best alignment. +This bam will then be deduplicated based on UMI and alignment position. + +Reads are then quantified using salmon, where a separate transcriptome-oriented bam is written. +Postmaster then takes this bam and adds posterior probabilities for transcript assignments. + +When runSTAR(), runSalmon(), and runPostmaster() are run in succession, the output is a directory called . +In this directory, the STAR output is .Aligned.sortedByCoord.out.bam in STAR/, +the salmon output is .quant.sf and .salmon.bam in salmon/, +and the postmaster output is .postmaster.bam in postmaster/ + +Requires STAR, salmon(>= 1.9.0), and postmaster be in user's PATH. +''' + +def runSTAR(reads1, reads2, nthreads, STARindex, samplename): + if not os.path.exists('STAR'): + os.mkdir('STAR') + + cwd = os.getcwd() + outdir = os.path.join(cwd, 'STAR') + + #Clean output directory if it already exists + if os.path.exists(outdir) and os.path.isdir(outdir): + shutil.rmtree(outdir) + + os.mkdir(outdir) + prefix = outdir + '/' + samplename + + command = ['STAR', '--runMode', 'alignReads', '--runThreadN', nthreads, '--genomeLoad', 'NoSharedMemory', '--genomeDir', STARindex, '--readFilesIn', reads1, reads2, '--readFilesCommand', + 'zcat', '--outFileNamePrefix', prefix, '--outSAMtype', 'BAM', 'SortedByCoordinate', '--outSAMstrandField', 'intronMotif', '--outSAMmultNmax', '1', '--outSAMattributes', 'MD', 'NH'] + + print('Running STAR for {0}...'.format(samplename)) + + subprocess.run(command) + + #make index + bam = os.path.join(outdir, samplename + 'Aligned.sortedByCoord.out.bam') + bamindex = bam + '.bai' + if not os.path.exists(bamindex): + indexCMD = 'samtools index ' + bam + index = subprocess.Popen(indexCMD, shell=True) + index.wait() + + print('Finished STAR for {0}!'.format(samplename)) + + +def runDedup(samplename, nthreads): + STARbam = os.path.join(os.getcwd(), 'STAR', '{0}Aligned.sortedByCoord.out.bam'.format(samplename)) + dedupbam = os.path.join(os.getcwd(), 'STAR', '{0}.dedup.bam'.format(samplename)) + command = ['umi_tools', 'dedup', '-I', STARbam, '--paired', ' --output-stats=deduplicated', '-S', dedupbam] + + print('Running deduplication for {0}...'.format(samplename)) + + subprocess.run(command) + + command = ['samtools', 'index', dedupbam] + subprocess.run(command) + + #We don't need the STAR alignment file anymore, and it's pretty big + # os.remove(STARbam) + + print('Finished deduplicating {0}!'.format(samplename)) + + +def bamtofastq(samplename, nthreads, dedup): + #Given a bam file of uniquely aligned reads (produced from runSTAR), rederive these reads as fastq in preparation for submission to salmon + if not os.path.exists('STAR'): + os.mkdir('STAR') + + cwd = os.getcwd() + outdir = os.path.join(cwd, 'STAR') + if dedup: + inbam = os.path.join(outdir, samplename + '.dedup.bam') + else: + inbam = os.path.join(outdir, samplename + 'Aligned.sortedByCoord.out.bam') + sortedbam = os.path.join(outdir, 'temp.namesort.bam') + + #First sort bam file by readname + print('Sorting bam file by read name...') + command = ['samtools', 'collate', '--threads', nthreads, '-u', '-o', sortedbam, inbam] + subprocess.call(command) + print('Done!') + + #Now derive fastq + if dedup: + r1file = samplename + '.dedup.r1.fq.gz' + r2file = samplename + '.dedup.r2.fq.gz' + else: + r1file = samplename + '.STARaligned.r1.fq.gz' + r2file = samplename + '.STARaligned.r2.fq.gz' + print('Writing fastq file of deduplicated reads for {0}...'.format(samplename)) + command = ['samtools', 'fastq', '--threads', nthreads, '-1', r1file, '-2', r2file, '-0', '/dev/null', '-s', '/dev/null', '-n', sortedbam] + subprocess.call(command) + print('Done writing fastq files for {0}!'.format(samplename)) + os.remove(sortedbam) + + +def runSalmon(reads1, reads2, nthreads, salmonindex, samplename): + #Take in those deduplicated reads and quantify transcript abundance with them using salmon. + + if not os.path.exists('salmon'): + os.mkdir('salmon') + + idx = os.path.abspath(salmonindex) + r1 = os.path.abspath(reads1) + r2 = os.path.abspath(reads2) + + os.chdir('salmon') + + command = ['salmon', 'quant', '--libType', 'A', '-p', nthreads, '--seqBias', '--gcBias', + '--validateMappings', '-1', r1, '-2', r2, '-o', samplename, '--index', idx, '--writeMappings={0}.salmon.bam'.format(samplename), '--writeQualities'] + + print('Running salmon for {0}...'.format(samplename)) + + subprocess.call(command) + + #Move output + outputdir = os.path.join(os.getcwd(), samplename) + quantfile = os.path.join(outputdir, 'quant.sf') + movedquantfile = os.path.join(os.getcwd(), '{0}.quant.sf'.format(samplename)) + os.rename(quantfile, movedquantfile) + + #Remove uniquely aligning read files + os.remove(r1) + os.remove(r2) + + print('Finished salmon for {0}!'.format(samplename)) + + +def runPostmaster(samplename, nthreads): + if not os.path.exists('postmaster'): + os.mkdir('postmaster') + + salmonquant = os.path.join(os.getcwd(), 'salmon', '{0}.quant.sf'.format(samplename)) + salmonbam = os.path.join(os.getcwd(), 'salmon', '{0}.salmon.bam'.format(samplename)) + + os.chdir('postmaster') + outputfile = os.path.join(os.getcwd(), '{0}.postmaster.bam'.format(samplename)) + + print('Running postmaster for {0}...'.format(samplename)) + command = ['postmaster', '--num-threads', nthreads, '--quant', salmonquant, '--alignments', salmonbam, '--output', outputfile] + subprocess.call(command) + + #Sort and index bam + with open(outputfile + '.sort', 'w') as sortedfh: + command = ['samtools', 'sort', '-@', nthreads, outputfile] + subprocess.run(command, stdout = sortedfh) + os.rename(outputfile + '.sort', outputfile) + + command = ['samtools', 'index', outputfile] + subprocess.run(command) + + #We don't need the salmon alignment file anymore, and it's pretty big + os.remove(salmonbam) + + print('Finished postmaster for {0}!'.format(samplename)) + + +def addMD(samplename, reffasta, nthreads): + inputbam = os.path.join(os.getcwd(), 'postmaster', '{0}.postmaster.bam'.format(samplename)) + command = ['samtools', 'calmd', '-b', '--threads', nthreads, inputbam, reffasta] + + print('Adding MD tags to {0}.postmaster.md.bam...'.format(samplename)) + with open(samplename + '.postmaster.md.bam', 'w') as outfile: + subprocess.run(command, stdout = outfile) + print('Finished adding MD tags to {0}.postmaster.md.bam!'.format(samplename)) + +if __name__ == '__main__': + parser = argparse.ArgumentParser(description = 'Align and quantify reads using STAR, salmon, and postmaster in preparation for analysis with PIGPEN.') + parser.add_argument('--forwardreads', type = str, help = 'Forward reads. Gzipped fastq.') + parser.add_argument('--reversereads', type = str, help = 'Reverse reads. Gzipped fastq.') + parser.add_argument('--nthreads', type = str, help = 'Number of threads to use for alignment and quantification.') + parser.add_argument('--STARindex', type = str, help = 'STAR index directory.') + parser.add_argument('--salmonindex', type = str, help = 'Salmon index directory.') + parser.add_argument('--samplename', type = str, help = 'Sample name. Will be appended to output files.') + parser.add_argument('--dedupUMI', action = 'store_true', help = 'Deduplicate UMIs? requires UMI extract.') + args = parser.parse_args() + + r1 = os.path.abspath(args.forwardreads) + r2 = os.path.abspath(args.reversereads) + STARindex = os.path.abspath(args.STARindex) + salmonindex = os.path.abspath(args.salmonindex) + samplename = args.samplename + nthreads = args.nthreads + + wd = os.path.abspath(os.getcwd()) + sampledir = os.path.join(wd, samplename) + if os.path.exists(sampledir) and os.path.isdir(sampledir): + shutil.rmtree(sampledir) + os.mkdir(sampledir) + os.chdir(sampledir) + + #uniquely aligning read files + if args.dedupUMI: + salmonR1 = samplename + '.dedup.r1.fq.gz' + salmonR2 = samplename + '.dedup.r2.fq.gz' + else: + salmonR1 = samplename + '.STARaligned.r1.fq.gz' + salmonR2 = samplename + '.STARaligned.r2.fq.gz' + + runSTAR(r1, r2, nthreads, STARindex, samplename) + if args.dedupUMI: + runDedup(samplename, nthreads) + bamtofastq(samplename, nthreads, args.dedupUMI) + runSalmon(salmonR1, salmonR2, nthreads, salmonindex, samplename) + os.chdir(sampledir) + runPostmaster(samplename, nthreads) + + diff --git a/pigpen.py b/pigpen.py index 2b1d69b..be31b32 100644 --- a/pigpen.py +++ b/pigpen.py @@ -33,6 +33,7 @@ parser.add_argument('--minMappingQual', type = int, help = 'Minimum mapping quality for a read to be considered in conversion counting. STAR unique mappers have MAPQ 255.', required = True) parser.add_argument('--nConv', type = int, help = 'Minimum number of required G->T and/or G->C conversions in a read pair in order for conversions to be counted. Default is 1.', default = 1) parser.add_argument('--outputDir', type = str, help = 'Output directory.', required = True) + parser.add_argument('--dedupUMI', action = 'store_true', help = 'Deduplicate UMIs? requires UMI extract.') args = parser.parse_args() #Store command line arguments @@ -46,6 +47,7 @@ samplenames = args.samplenames.split(',') salmonquants = [os.path.join(x, 'salmon', '{0}.quant.sf'.format(x)) for x in samplenames] starbams = [os.path.join(x, 'STAR', '{0}Aligned.sortedByCoord.out.bam'.format(x)) for x in samplenames] + dedupbams = [os.path.join(x, 'STAR', '{0}.dedup.bam'.format(x)) for x in samplenames] postmasterbams = [os.path.join(x, 'postmaster', '{0}.postmaster.bam'.format(x)) for x in samplenames] #Take in list of control samples, make list of their corresponding star bams for SNP calling @@ -56,9 +58,13 @@ if x in controlsamples: controlindicies.append(ind) - controlstarbams = [] - for x in controlindicies: - controlstarbams.append(starbams[x]) + controlsamplebams = [] + if args.dedupUMI: + for x in controlindicies: + controlsamplebams.append(dedupbams[x]) + else: + for x in controlindicies: + controlsamplebams.append(starbams[x]) #We have to be either looking for G->T or G->C, if not both if not args.use_g_t and not args.use_g_c: @@ -74,7 +80,7 @@ if args.onlyConsiderOverlap and (not args.use_read1 or not args.use_read2): print('If we are only going to consider overlap between paired reads, we must use both read1 and read2.') sys.exit() - + #Make vcf file for snps if args.snpfile: snps = recordSNPs(args.snpfile) @@ -84,7 +90,7 @@ if args.useSNPs and not args.snpfile: if not os.path.exists('snps'): os.mkdir('snps') - vcfFileNames = getSNPs(controlstarbams, args.genomeFasta, args.SNPcoverage, args.SNPfreq) + vcfFileNames = getSNPs(controlsamplebams, args.genomeFasta, args.SNPcoverage, args.SNPfreq) for f in vcfFileNames: csi = f + '.csi' log = f[:-3] + '.log' @@ -96,7 +102,7 @@ os.rename('merged.vcf', os.path.join('snps', 'merged.vcf')) os.rename('vcfconcat.log', os.path.join('snps', 'vcfconcat.log')) snps = recordSNPs(os.path.join('snps', 'merged.vcf')) - + elif not args.useSNPs and not args.snpfile: snps = None @@ -117,12 +123,20 @@ sampleparams['sample'] = sample print('Running PIGPEN for {0}...'.format(sample)) - starbam = starbams[ind] - sampleparams['starbam'] = os.path.abspath(starbam) - if args.nproc == 1: - convs, readcounter = iteratereads_pairedend(starbam, args.onlyConsiderOverlap, args.use_g_t, args.use_g_c, args.use_read1, args.use_read2, args.nConv, args.minMappingQual, snps, maskpositions, 'high') - elif args.nproc > 1: - convs = getmismatches(starbam, args.onlyConsiderOverlap, snps, maskpositions, args.nConv, args.minMappingQual, args.nproc, args.use_g_t, args.use_g_c, args.use_read1, args.use_read2) + if args.dedupUMI: + dedupbam = dedupbams[ind] + sampleparams['dedupbam'] = os.path.abspath(dedupbam) + if args.nproc == 1: + convs, readcounter = iteratereads_pairedend(dedupbam, args.onlyConsiderOverlap, args.use_g_t, args.use_g_c, args.use_read1, args.use_read2, args.nConv, args.minMappingQual, snps, maskpositions, 'high') + elif args.nproc > 1: + convs = getmismatches(dedupbam, args.onlyConsiderOverlap, snps, maskpositions, args.nConv, args.minMappingQual, args.nproc, args.use_g_t, args.use_g_c, args.use_read1, args.use_read2) + else: + starbam = starbams[ind] + sampleparams['starbam'] = os.path.abspath(starbam) + if args.nproc == 1: + convs, readcounter = iteratereads_pairedend(starbam, args.onlyConsiderOverlap, args.use_g_t, args.use_g_c, args.use_read1, args.use_read2, args.nConv, args.minMappingQual, snps, maskpositions, 'high') + elif args.nproc > 1: + convs = getmismatches(starbam, args.onlyConsiderOverlap, snps, maskpositions, args.nConv, args.minMappingQual, args.nproc, args.use_g_t, args.use_g_c, args.use_read1, args.use_read2) print('Getting posterior probabilities from salmon alignment file...') postmasterbam = postmasterbams[ind] @@ -161,14 +175,24 @@ sampleparams['sample'] = sample print('Running PIGPEN for {0}...'.format(sample)) - starbam = starbams[ind] - sampleparams['starbam'] = os.path.abspath(starbam) - if args.nproc == 1: - convs, readcounter = iteratereads_pairedend( - starbam, args.onlyConsiderOverlap, args.use_g_t, args.use_g_c, args.use_read1, args.use_read2, args.nConv, args.minMappingQual, snps, maskpositions, 'high') - elif args.nproc > 1: - convs = getmismatches(starbam, args.onlyConsiderOverlap, snps, maskpositions, - args.nConv, args.minMappingQual, args.nproc, args.use_g_t, args.use_g_c, args.use_read1, args.use_read2) + if args.dedupUMI: + dedupbam = dedupbams[ind] + sampleparams['dedupbam'] = os.path.abspath(starbam) + if args.nproc == 1: + convs, readcounter = iteratereads_pairedend( + dedupbam, args.onlyConsiderOverlap, args.use_g_t, args.use_g_c, args.use_read1, args.use_read2, args.nConv, args.minMappingQual, snps, maskpositions, 'high') + elif args.nproc > 1: + convs = getmismatches(dedupbam, args.onlyConsiderOverlap, snps, maskpositions, + args.nConv, args.minMappingQual, args.nproc, args.use_g_t, args.use_g_c, args.use_read1, args.use_read2) + else: + starbam = starbams[ind] + sampleparams['starbam'] = os.path.abspath(starbam) + if args.nproc == 1: + convs, readcounter = iteratereads_pairedend( + starbam, args.onlyConsiderOverlap, args.use_g_t, args.use_g_c, args.use_read1, args.use_read2, args.nConv, args.minMappingQual, snps, maskpositions, 'high') + elif args.nproc > 1: + convs = getmismatches(starbam, args.onlyConsiderOverlap, snps, maskpositions, + args.nConv, args.minMappingQual, args.nproc, args.use_g_t, args.use_g_c, args.use_read1, args.use_read2) print('Assigning reads to genes in supplied bed file...') overlaps, numpairs = getReadOverlaps(starbam, args.ROIbed, 'chrsort.txt') @@ -177,4 +201,4 @@ if not os.path.exists(args.outputDir): os.mkdir(args.outputDir) outputfile = os.path.join(args.outputDir, sample + '.pigpen.txt') - writeConvsPerGene(sampleparams, numreadspergene, convsPerGene, outputfile, args.use_g_t, args.use_g_c) \ No newline at end of file + writeConvsPerGene(sampleparams, numreadspergene, convsPerGene, outputfile, args.use_g_t, args.use_g_c) From e7264fed939397238743e229c23eb3cbd09a5041 Mon Sep 17 00:00:00 2001 From: goeringr Date: Tue, 7 Feb 2023 07:28:08 -0700 Subject: [PATCH 048/108] more UMI tools updates simple script for extracting UMIs and simplification of --UMIdedup flag in pigpen.py --- ExtractUMI.py | 62 +++++++++++++++++++++++++++++++++++++++++++++ pigpen.py | 70 ++++++++++++++++++++------------------------------- 2 files changed, 89 insertions(+), 43 deletions(-) create mode 100644 ExtractUMI.py diff --git a/ExtractUMI.py b/ExtractUMI.py new file mode 100644 index 0000000..5ed59a3 --- /dev/null +++ b/ExtractUMI.py @@ -0,0 +1,62 @@ +import os +import subprocess +import sys +import shutil +import argparse +''' +Given a pair of read files, extract UMIs matching a given pattern +This step is required for any downstream UMI deduplication +Requires UMI_tools +Usage: +python -u ~/Projects/OINC_seq/test/py2test/ExtractUMI.py --forwardreads ./ATP5MC1_Rep1_pDBF.10M.R1.fq.gz,./ATP5MC1_Rep1_mDBF.10M.R1.fq.gz --reversereads ./ATP5MC1_Rep1_pDBF.10M.R2.fq.gz,./ATP5MC1_Rep1_mDBF.10M.R2.fq.gz --samplename pDBF10M,mDBF10M --lib_type LEXO + +''' +def runExtract(r1, r2, samplename, lib_type): + if not os.path.exists('UMI_fastq'): + os.mkdir('UMI_fastq') + + cwd = os.getcwd() + outdir = os.path.join(cwd, 'UMI_fastq') + + r1 = r1.split(",") + r2 = r2.split(",") + samplename = samplename.split(",") + + for idx, sample in enumerate(samplename): + + reads1 = r1[idx] + reads2 = r2[idx] + output1 = outdir + '/' + sample + '.R1.fq.gz' + output2 = outdir + '/' + sample + '.R2.fq.gz' + + if lib_type == "LEXO": + command = ["umi_tools", "extract", "-I", reads1, '--bc-pattern=NNNNNN', '--read2-in={0}'.format(reads2), '--stdout={0}'.format(output1),'--read2-out={0}'.format(output2)] + elif lib_type == "SA": + command = ["umi_tools", "extract", "-I", reads2, '--bc-pattern=NNNNNNNNNNNN', '--read2-in={0}'.format(reads2), '--stdout={0}'.format(output1),'--read2-out={0}'.format(output2)] + else: + print('--lib_type must be either "LEXO" or "SA"') + sys.exit() + + print('Extracting UMIs for {0}...'.format(sample)) + subprocess.run(command) + print('Finished Extracting UMIs for {0}!'.format(sample)) + +if __name__ == '__main__': + parser = argparse.ArgumentParser(description = 'Extract UMIs using umi-tools in preparation for analysis with AlignUMIquant.') + parser.add_argument('--forwardreads', type = str, help = 'Forward reads. Gzipped fastq.', required = True) + parser.add_argument('--reversereads', type = str, help = 'Reverse reads. Gzipped fastq.', required = True) + parser.add_argument('--samplenames', type = str, help = 'Sample name. Will be appended to output files.', required = True) + parser.add_argument('--lib_type', type = str, help = 'Library type. Either "LEXO" or "SA"', required = True) + args = parser.parse_args() + + r1 = os.path.abspath(args.forwardreads) + r2 = os.path.abspath(args.reversereads) + samplename = args.samplenames + lib_type = args.lib_type + + if args.lib_type not in ["LEXO", "SA"]: + print('--lib_type must be either "LEXO" or "SA"') + sys.exit() + + runExtract(r1, r2, samplename, lib_type) + diff --git a/pigpen.py b/pigpen.py index be31b32..f4b381a 100644 --- a/pigpen.py +++ b/pigpen.py @@ -54,17 +54,17 @@ if args.controlsamples: controlsamples = args.controlsamples.split(',') controlindicies = [] - for ind, x in enumerate(samplenames): - if x in controlsamples: - controlindicies.append(ind) - - controlsamplebams = [] - if args.dedupUMI: - for x in controlindicies: - controlsamplebams.append(dedupbams[x]) + samplebams = [] + controlsamplebams = [] + for ind, x in enumerate(samplenames): + if args.dedupUMI: + samplebams.append(dedupbams[ind]) + if args.controlsamples and x in controlsamples: + controlsamplebams.append(dedupbams[ind]) else: - for x in controlindicies: - controlsamplebams.append(starbams[x]) + samplebams.append(starbams[ind]) + if args.controlsamples and x in controlsamples: + controlsamplebams.append(starbams[ind]) #We have to be either looking for G->T or G->C, if not both if not args.use_g_t and not args.use_g_c: @@ -123,20 +123,13 @@ sampleparams['sample'] = sample print('Running PIGPEN for {0}...'.format(sample)) - if args.dedupUMI: - dedupbam = dedupbams[ind] - sampleparams['dedupbam'] = os.path.abspath(dedupbam) - if args.nproc == 1: - convs, readcounter = iteratereads_pairedend(dedupbam, args.onlyConsiderOverlap, args.use_g_t, args.use_g_c, args.use_read1, args.use_read2, args.nConv, args.minMappingQual, snps, maskpositions, 'high') - elif args.nproc > 1: - convs = getmismatches(dedupbam, args.onlyConsiderOverlap, snps, maskpositions, args.nConv, args.minMappingQual, args.nproc, args.use_g_t, args.use_g_c, args.use_read1, args.use_read2) - else: - starbam = starbams[ind] - sampleparams['starbam'] = os.path.abspath(starbam) - if args.nproc == 1: - convs, readcounter = iteratereads_pairedend(starbam, args.onlyConsiderOverlap, args.use_g_t, args.use_g_c, args.use_read1, args.use_read2, args.nConv, args.minMappingQual, snps, maskpositions, 'high') - elif args.nproc > 1: - convs = getmismatches(starbam, args.onlyConsiderOverlap, snps, maskpositions, args.nConv, args.minMappingQual, args.nproc, args.use_g_t, args.use_g_c, args.use_read1, args.use_read2) + + samplebam = samplebams[ind] + sampleparams['samplebam'] = os.path.abspath(samplebam) + if args.nproc == 1: + convs, readcounter = iteratereads_pairedend(samplebam, args.onlyConsiderOverlap, args.use_g_t, args.use_g_c, args.use_read1, args.use_read2, args.nConv, args.minMappingQual, snps, maskpositions, 'high') + elif args.nproc > 1: + convs = getmismatches(samplebam, args.onlyConsiderOverlap, snps, maskpositions, args.nConv, args.minMappingQual, args.nproc, args.use_g_t, args.use_g_c, args.use_read1, args.use_read2) print('Getting posterior probabilities from salmon alignment file...') postmasterbam = postmasterbams[ind] @@ -175,27 +168,18 @@ sampleparams['sample'] = sample print('Running PIGPEN for {0}...'.format(sample)) - if args.dedupUMI: - dedupbam = dedupbams[ind] - sampleparams['dedupbam'] = os.path.abspath(starbam) - if args.nproc == 1: - convs, readcounter = iteratereads_pairedend( - dedupbam, args.onlyConsiderOverlap, args.use_g_t, args.use_g_c, args.use_read1, args.use_read2, args.nConv, args.minMappingQual, snps, maskpositions, 'high') - elif args.nproc > 1: - convs = getmismatches(dedupbam, args.onlyConsiderOverlap, snps, maskpositions, - args.nConv, args.minMappingQual, args.nproc, args.use_g_t, args.use_g_c, args.use_read1, args.use_read2) - else: - starbam = starbams[ind] - sampleparams['starbam'] = os.path.abspath(starbam) - if args.nproc == 1: - convs, readcounter = iteratereads_pairedend( - starbam, args.onlyConsiderOverlap, args.use_g_t, args.use_g_c, args.use_read1, args.use_read2, args.nConv, args.minMappingQual, snps, maskpositions, 'high') - elif args.nproc > 1: - convs = getmismatches(starbam, args.onlyConsiderOverlap, snps, maskpositions, - args.nConv, args.minMappingQual, args.nproc, args.use_g_t, args.use_g_c, args.use_read1, args.use_read2) + + samplebam = samplebam[ind] + sampleparams['samplebam'] = os.path.abspath(samplebam) + if args.nproc == 1: + convs, readcounter = iteratereads_pairedend( + samplebam, args.onlyConsiderOverlap, args.use_g_t, args.use_g_c, args.use_read1, args.use_read2, args.nConv, args.minMappingQual, snps, maskpositions, 'high') + elif args.nproc > 1: + convs = getmismatches(samplebam, args.onlyConsiderOverlap, snps, maskpositions, + args.nConv, args.minMappingQual, args.nproc, args.use_g_t, args.use_g_c, args.use_read1, args.use_read2) print('Assigning reads to genes in supplied bed file...') - overlaps, numpairs = getReadOverlaps(starbam, args.ROIbed, 'chrsort.txt') + overlaps, numpairs = getReadOverlaps(samplebam, args.ROIbed, 'chrsort.txt') read2gene = processOverlaps(overlaps, numpairs) numreadspergene, convsPerGene = getPerGene(convs, read2gene) if not os.path.exists(args.outputDir): From b74aa97be4c4c539612ddad21c722cfb1d0fafc1 Mon Sep 17 00:00:00 2001 From: goeringr Date: Mon, 13 Feb 2023 08:09:12 -0700 Subject: [PATCH 049/108] added --libType parameter This will make alignUMIquant.py more single amplicon friendly. Unneccessary proccesses (salmon, postmaser) will not be ran for single amplicon libraries. --- .gitignore | 10 ++++++++++ alignUMIquant.py | 36 +++++++++++++++++++++--------------- pigpen.py | 6 +++--- 3 files changed, 34 insertions(+), 18 deletions(-) create mode 100644 .gitignore diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..b6bf794 --- /dev/null +++ b/.gitignore @@ -0,0 +1,10 @@ + +.DS_Store +.spyproject/config/codestyle.ini +.spyproject/config/defaults/defaults-codestyle-0.2.0.ini +.spyproject/config/defaults/defaults-encoding-0.2.0.ini +.spyproject/config/defaults/defaults-vcs-0.2.0.ini +.spyproject/config/defaults/defaults-workspace-0.2.0.ini +.spyproject/config/workspace.ini +.spyproject/config/vcs.ini +.spyproject/config/encoding.ini diff --git a/alignUMIquant.py b/alignUMIquant.py index b881e7e..1289164 100644 --- a/alignUMIquant.py +++ b/alignUMIquant.py @@ -62,8 +62,12 @@ def runSTAR(reads1, reads2, nthreads, STARindex, samplename): def runDedup(samplename, nthreads): STARbam = os.path.join(os.getcwd(), 'STAR', '{0}Aligned.sortedByCoord.out.bam'.format(samplename)) dedupbam = os.path.join(os.getcwd(), 'STAR', '{0}.dedup.bam'.format(samplename)) - command = ['umi_tools', 'dedup', '-I', STARbam, '--paired', ' --output-stats=deduplicated', '-S', dedupbam] - + if args.libType == "LEXO": + command = ['umi_tools', 'dedup', '-I', STARbam, '--paired', '-S', dedupbam] + elif args.libType == "SA": + command = ['umi_tools', 'dedup', '-I', STARbam, '--paired', '--method=unique', '-S', dedupbam] + else: + print('LibType must be either "LEXO" or "SA".') print('Running deduplication for {0}...'.format(samplename)) subprocess.run(command) @@ -189,12 +193,12 @@ def addMD(samplename, reffasta, nthreads): parser.add_argument('--salmonindex', type = str, help = 'Salmon index directory.') parser.add_argument('--samplename', type = str, help = 'Sample name. Will be appended to output files.') parser.add_argument('--dedupUMI', action = 'store_true', help = 'Deduplicate UMIs? requires UMI extract.') + parser.add_argument('--libType', type = str, help = 'Library Type, either "LEXO" or "SA"') args = parser.parse_args() r1 = os.path.abspath(args.forwardreads) r2 = os.path.abspath(args.reversereads) STARindex = os.path.abspath(args.STARindex) - salmonindex = os.path.abspath(args.salmonindex) samplename = args.samplename nthreads = args.nthreads @@ -205,20 +209,22 @@ def addMD(samplename, reffasta, nthreads): os.mkdir(sampledir) os.chdir(sampledir) - #uniquely aligning read files - if args.dedupUMI: - salmonR1 = samplename + '.dedup.r1.fq.gz' - salmonR2 = samplename + '.dedup.r2.fq.gz' - else: - salmonR1 = samplename + '.STARaligned.r1.fq.gz' - salmonR2 = samplename + '.STARaligned.r2.fq.gz' runSTAR(r1, r2, nthreads, STARindex, samplename) if args.dedupUMI: runDedup(samplename, nthreads) - bamtofastq(samplename, nthreads, args.dedupUMI) - runSalmon(salmonR1, salmonR2, nthreads, salmonindex, samplename) - os.chdir(sampledir) - runPostmaster(samplename, nthreads) - + if args.libType == "LEXO": + salmonindex = os.path.abspath(args.salmonindex) + #uniquely aligning or deduplicated read files + if args.dedupUMI: + salmonR1 = samplename + '.dedup.r1.fq.gz' + salmonR2 = samplename + '.dedup.r2.fq.gz' + else: + salmonR1 = samplename + '.STARaligned.r1.fq.gz' + salmonR2 = samplename + '.STARaligned.r2.fq.gz' + + bamtofastq(samplename, nthreads, args.dedupUMI) + runSalmon(salmonR1, salmonR2, nthreads, salmonindex, samplename) + os.chdir(sampledir) + runPostmaster(samplename, nthreads) \ No newline at end of file diff --git a/pigpen.py b/pigpen.py index f4b381a..7f865ec 100644 --- a/pigpen.py +++ b/pigpen.py @@ -57,7 +57,7 @@ samplebams = [] controlsamplebams = [] for ind, x in enumerate(samplenames): - if args.dedupUMI: + if args.dedupUMI: samplebams.append(dedupbams[ind]) if args.controlsamples and x in controlsamples: controlsamplebams.append(dedupbams[ind]) @@ -169,7 +169,7 @@ print('Running PIGPEN for {0}...'.format(sample)) - samplebam = samplebam[ind] + samplebam = samplebams[ind] sampleparams['samplebam'] = os.path.abspath(samplebam) if args.nproc == 1: convs, readcounter = iteratereads_pairedend( @@ -185,4 +185,4 @@ if not os.path.exists(args.outputDir): os.mkdir(args.outputDir) outputfile = os.path.join(args.outputDir, sample + '.pigpen.txt') - writeConvsPerGene(sampleparams, numreadspergene, convsPerGene, outputfile, args.use_g_t, args.use_g_c) + writeConvsPerGene(sampleparams, numreadspergene, convsPerGene, outputfile, args.use_g_t, args.use_g_c) \ No newline at end of file From e30dc817690465dcdba1ed1ad6af6e23d2d08814 Mon Sep 17 00:00:00 2001 From: goeringr Date: Thu, 16 Feb 2023 08:55:08 -0700 Subject: [PATCH 050/108] SnakeMakeWorkflow1.0 Very Helpful pipeliner --- .gitignore | 4 + workflow/config.yaml | 41 ++++++++++ workflow/rules/B2FQ.smk | 48 ++++++++++++ workflow/rules/STARalign.smk | 29 +++++++ workflow/rules/UMIdedup.smk | 26 +++++++ workflow/rules/UMIextract.smk | 32 ++++++++ workflow/rules/runBacon.smk | 27 +++++++ workflow/rules/runPM.smk | 24 ++++++ workflow/rules/runPigpen.smk | 138 ++++++++++++++++++++++++++++++++++ workflow/rules/runSalmon.smk | 47 ++++++++++++ workflow/rules/trimming.smk | 101 +++++++++++++++++++++++++ workflow/snakefile | 17 +++++ 12 files changed, 534 insertions(+) create mode 100755 workflow/config.yaml create mode 100644 workflow/rules/B2FQ.smk create mode 100644 workflow/rules/STARalign.smk create mode 100644 workflow/rules/UMIdedup.smk create mode 100644 workflow/rules/UMIextract.smk create mode 100644 workflow/rules/runBacon.smk create mode 100644 workflow/rules/runPM.smk create mode 100644 workflow/rules/runPigpen.smk create mode 100644 workflow/rules/runSalmon.smk create mode 100644 workflow/rules/trimming.smk create mode 100644 workflow/snakefile diff --git a/.gitignore b/.gitignore index b6bf794..92607ee 100644 --- a/.gitignore +++ b/.gitignore @@ -8,3 +8,7 @@ .spyproject/config/workspace.ini .spyproject/config/vcs.ini .spyproject/config/encoding.ini +.spyproject/config/backups/codestyle.ini.bak +.spyproject/config/backups/encoding.ini.bak +.spyproject/config/backups/vcs.ini.bak +.spyproject/config/backups/workspace.ini.bak diff --git a/workflow/config.yaml b/workflow/config.yaml new file mode 100755 index 0000000..a630652 --- /dev/null +++ b/workflow/config.yaml @@ -0,0 +1,41 @@ +samples: + ["ATP5MC1_Rep1_pDBF","ATP5MC1_Rep1_mDBF"] + +libtype: "LEXO" # "SA" or "LEXO" +dedupUMI: True # True or False +threads: 32 # Number of threads +STARindex: "/beevol/home/goeringr/Projects/OINC_seq/SSIV_tx_compare/LEXO/STARindex" +Salmonindex: "/beevol/home/goeringr/Projects/OINC_seq/SSIV_tx_compare/LEXO/transcripts32.idx" + +pigpen: + # path to files + pigpen: "/beevol/home/goeringr/Projects/OINC_seq/test/py2test/pigpen.py" + gff: "/beevol/home/goeringr/Annotations/hg38/gencode.v32.annotation.gff3.gz" + genomeFASTA: "/beevol/home/goeringr/Annotations/hg38/GRCh38.p13.genome.fa" + snpfile: "" + maskbed: "" + ROIbed: "" + # comma separated string + controlsamples: "ATP5MC1_Rep1_mDBF" + # parameter values + SNPcoverage: "20" + SNPfreq: "0.2" + nconv: "1" + minMappingQual: "60" + # output directory name + outputDir: "PIGPEN" + # space separated string + tags: "--useSNPs --use_g_t --use_g_c --onlyConsiderOverlap --use_read1 --use_read2 --dedupUMI" + +bacon: + # path to files + bacon: "/beevol/home/goeringr/Projects/OINC_seq/test/py2test/bacon_glm.py" + sampconds: + ["/beevol/home/goeringr/Projects/OINC_seq/OINC_MAVS/bacon_MAVS.txt"] + minreads: "" + conditionA: "mDBF" + conditionB: "pDBF" + tags: "--use_g_t --use_g_c --considernonG" + output: + ["MAVS.bacon.txt"] + diff --git a/workflow/rules/B2FQ.smk b/workflow/rules/B2FQ.smk new file mode 100644 index 0000000..f2788b9 --- /dev/null +++ b/workflow/rules/B2FQ.smk @@ -0,0 +1,48 @@ +""" +Rules for writing bam to paired fq files +For usage, include this in your workflow. +""" + + +if not config["dedupUMI"] and config["libtype"]=="LEXO": + rule STARbam2FQ: + """converts STAR aligned bam to paired fq""" + input: + bam=expand("PIGPEN_alignments/{sample}/STAR/{sample}Aligned.sortedByCoord.out.bam", sample=config["samples"]), + bai=expand("PIGPEN_alignments/{sample}/STAR/{sample}Aligned.sortedByCoord.out.bam.bai", sample=config["samples"]), + output: + sortedbam=temp(expand("PIGPEN_alignments/{sample}/STAR/{sample}.temp.namesort.bam", sample=config["samples"])), + fq1=temp(expand("PIGPEN_alignments/{sample}/STAR/{sample}.STARaligned.r1.fq.gz", sample=config["samples"])), + fq2=temp(expand("PIGPEN_alignments/{sample}/STAR/{sample}.STARaligned.r2.fq.gz", sample=config["samples"])), + threads: config["threads"] + run: + for bam,sortedbam,fq1,fq2 in zip(input.bam,output.sortedbam,output.fq1,output.fq2): + shell( + "samtools collate --threads {threads} -u -o {sortedbam} {bam}" + ) + shell( + "samtools fastq --threads {threads} -1 {fq1} -2 {fq2} " + "-0 /dev/null -s /dev/null -n {sortedbam}" + ) + +elif config["dedupUMI"] and config["libtype"]=="LEXO": + rule Dedupbam2FQ: + """converts deduplicated STAR bam to paired fq""" + input: + bam=expand("PIGPEN_alignments/{sample}/STAR/{sample}.dedup.bam", sample=config["samples"]), + bai=expand("PIGPEN_alignments/{sample}/STAR/{sample}.dedup.bam.bai", sample=config["samples"]), + output: + sortedbam=temp(expand("PIGPEN_alignments/{sample}/STAR/{sample}.temp.namesort.bam", sample=config["samples"])), + fq1=temp(expand("PIGPEN_alignments/{sample}/STAR/{sample}.dedup.r1.fq.gz", sample=config["samples"])), + fq2=temp(expand("PIGPEN_alignments/{sample}/STAR/{sample}.dedup.r2.fq.gz", sample=config["samples"])), + threads: config["threads"] + run: + for bam,sortedbam,fq1,fq2 in zip(input.bam,output.sortedbam,output.fq1,output.fq2): + shell( + "samtools collate --threads {threads} -u -o {sortedbam} {bam}" + ) + shell( + "samtools fastq --threads {threads} -1 {fq1} -2 {fq2} " + "-0 /dev/null -s /dev/null -n {sortedbam}" + ) + diff --git a/workflow/rules/STARalign.smk b/workflow/rules/STARalign.smk new file mode 100644 index 0000000..dab1ddb --- /dev/null +++ b/workflow/rules/STARalign.smk @@ -0,0 +1,29 @@ +""" +Rules for genome alignment with STAR +For usage, include this in your workflow. +""" + +rule runSTAR: + """aligns trimmed reads to genome with STAR""" + input: + read1=expand("cutadapt/{sample}_1.trimmed.fq.gz", sample=config["samples"]), + read2=expand("cutadapt/{sample}_2.trimmed.fq.gz", sample=config["samples"]), + STARindex=config["STARindex"], + params: + outprefix=expand("PIGPEN_alignments/{sample}/STAR/{sample}", sample=config["samples"]), + output: + bam=expand("PIGPEN_alignments/{sample}/STAR/{sample}Aligned.sortedByCoord.out.bam", sample=config["samples"]), + bai=expand("PIGPEN_alignments/{sample}/STAR/{sample}Aligned.sortedByCoord.out.bam.bai", sample=config["samples"]), + threads: config["threads"] + run: + for r1,r2,prefix,bam in zip(input.read1,input.read2,params.outprefix,output.bam): + shell( + "STAR --runMode alignReads --runThreadN {threads} " + "--genomeLoad NoSharedMemory --genomeDir {input.STARindex} " + "--readFilesIn {r1} {r2} --readFilesCommand zcat " + "--outFileNamePrefix {prefix} --outSAMtype BAM " + "SortedByCoordinate --outSAMstrandField intronMotif " + "--outSAMattributes MD NH --outSAMmultNmax 1" + ) + shell("samtools index {bam}") + diff --git a/workflow/rules/UMIdedup.smk b/workflow/rules/UMIdedup.smk new file mode 100644 index 0000000..9399551 --- /dev/null +++ b/workflow/rules/UMIdedup.smk @@ -0,0 +1,26 @@ +""" +Rules for UMI deduplication with UMI-tools +For usage, include this in your workflow. +""" + +if config["dedupUMI"]: + rule dedupSTAR: + """aligns trimmed reads to genome with STAR""" + input: + bam=expand("PIGPEN_alignments/{sample}/STAR/{sample}Aligned.sortedByCoord.out.bam", sample=config["samples"]), + bai=expand("PIGPEN_alignments/{sample}/STAR/{sample}Aligned.sortedByCoord.out.bam.bai", sample=config["samples"]), + + output: + bam=expand("PIGPEN_alignments/{sample}/STAR/{sample}.dedup.bam", sample=config["samples"]), + bai=expand("PIGPEN_alignments/{sample}/STAR/{sample}.dedup.bam.bai", sample=config["samples"]), + run: + for Inbam,Outbam in zip(input.bam,output.bam): + if config["libtype"]=="LEXO": + shell("umi_tools dedup -I {Inbam} --paired -S {Outbam}") + shell("samtools index {Outbam}") + elif config["libtype"]=="SA": + shell("umi_tools dedup -I {Inbam} --paired --method=unique -S {Outbam}") + shell("samtools index {Outbam}") + else: + print('LibType must be either "LEXO" or "SA".') + diff --git a/workflow/rules/UMIextract.smk b/workflow/rules/UMIextract.smk new file mode 100644 index 0000000..30e4914 --- /dev/null +++ b/workflow/rules/UMIextract.smk @@ -0,0 +1,32 @@ +""" +Rules for extraction of UMIs with UMI_tools +For usage, include this in your workflow. +""" + +rule UMIextract: + """extracts UMIs from raw fastq files""" + input: + read1=expand("RAWREADS/{sample}.R1.fq.gz", sample=config["samples"]), + read2=expand("RAWREADS/{sample}.R2.fq.gz", sample=config["samples"]), + output: + read1=expand("UMI_fastq/{sample}_R1.fq.gz", sample=config["samples"]), + read2=expand("UMI_fastq/{sample}_R2.fq.gz", sample=config["samples"]), + run: + if not config["libtype"]: + print("libtype must be included in the config file") + elif config["libtype"]: + for r1,r2,o1,o2 in zip(input.read1,input.read2,output.read1,output.read2): + if config["libtype"]=="LEXO": + shell( + "umi_tools extract -I {r1} " + "--bc-pattern=NNNNNN --read2-in={r2} " + "--stdout={o1} --read2-out={o2}" + ) + elif config["libtype"]=="SA": + shell( + "umi_tools extract -I {r2} " + "--bc-pattern=NNNNNNNNNNNN --read2-in={r1} " + "--stdout={o2} --read2-out={o1}" + ) + else: + print("libtype must be either 'LEXO' or 'SA'") diff --git a/workflow/rules/runBacon.smk b/workflow/rules/runBacon.smk new file mode 100644 index 0000000..ebef2a2 --- /dev/null +++ b/workflow/rules/runBacon.smk @@ -0,0 +1,27 @@ +""" +Rules for running bacon.py +For usage, include this in your workflow. +""" + +rule runBacon: + """runs Bacon with desired sample comparisons""" + input: + bacon=config["bacon"]["bacon"], + pigpen=expand("PIGPEN_alignments/{outDir}/{sample}.pigpen.txt", sample=config["samples"], outDir =config["pigpen"]["outputDir"]), + params: + SC=expand("{sampconds}", sampconds=config["bacon"]["sampconds"]), + MR=" --minreads " + config["bacon"]["minreads"] if config["bacon"]["minreads"] else "", + CA=" --conditionA " + config["bacon"]["conditionA"] if config["bacon"]["conditionA"] else "", + CB=" --conditionB " + config["bacon"]["conditionB"] if config["bacon"]["conditionB"] else "", + OT=" " + config["bacon"]["tags"] if config["bacon"]["tags"] else "", + OD=expand("PIGPEN_alignments/{outDir}/{output}", output=config["bacon"]["output"], outDir =config["pigpen"]["outputDir"]), + output: + expand("PIGPEN_alignments/{outDir}/{output}.bacon.txt", output=config["bacon"]["output"], outDir =config["pigpen"]["outputDir"]), + threads: config["threads"] + run: + for sampcond,output in zip(params.SC,params.OD): + shell( + "python -u {input.bacon} --sampconds {sampcond} " + "--output {output}{params.MR}{params.CA}{params.CB}{params.OT}" + ) + diff --git a/workflow/rules/runPM.smk b/workflow/rules/runPM.smk new file mode 100644 index 0000000..5f922c5 --- /dev/null +++ b/workflow/rules/runPM.smk @@ -0,0 +1,24 @@ +""" +Rules for alignment proability extraction with postmaster +For usage, include this in your workflow. +""" + +if config["libtype"]=="LEXO": + rule runPostmaster: + """extacts alignment probabilities from salmon bams and quantification""" + input: + sf=expand("PIGPEN_alignments/{sample}/salmon/{sample}.quant.sf", sample=config["samples"]), + bam=expand("PIGPEN_alignments/{sample}/salmon/{sample}.salmon.bam", sample=config["samples"]), + output: + bam=expand("PIGPEN_alignments/{sample}/postmaster/{sample}.postmaster.bam", sample=config["samples"]), + bai=expand("PIGPEN_alignments/{sample}/postmaster/{sample}.postmaster.bam.bai", sample=config["samples"]), + threads: config["threads"] + run: + for sf,Sbam,PMbam in zip(input.sf,input.bam,output.bam): + shell( + "postmaster --num-threads {threads} --quant {sf} " + "--alignments {Sbam} --output {PMbam}" + ) + shell("samtools sort -@ {threads} -o {PMbam} {PMbam}") + shell("samtools index {PMbam}") + diff --git a/workflow/rules/runPigpen.smk b/workflow/rules/runPigpen.smk new file mode 100644 index 0000000..9b18564 --- /dev/null +++ b/workflow/rules/runPigpen.smk @@ -0,0 +1,138 @@ +""" +Rules for running pigpen.py +For usage, include this in your workflow. +""" + +if config["libtype"]=="LEXO" and config["dedupUMI"]: + rule runPigpen_LEXO_UMI: + """runs Pigpen with desired parameters""" + input: + pigpen=config["pigpen"]["pigpen"], + dedupbam=expand("PIGPEN_alignments/{sample}/STAR/{sample}.dedup.bam", sample=config["samples"]), + dedupbai=expand("PIGPEN_alignments/{sample}/STAR/{sample}.dedup.bam.bai", sample=config["samples"]), + PMbam=expand("PIGPEN_alignments/{sample}/postmaster/{sample}.postmaster.bam", sample=config["samples"]), + PMbai=expand("PIGPEN_alignments/{sample}/postmaster/{sample}.postmaster.bam", sample=config["samples"]), + params: + samples=",".join(config["samples"]), + gff=" --gff " + config["pigpen"]["gff"] if config["pigpen"]["gff"] else "", + fa=" --genomeFasta " + config["pigpen"]["genomeFASTA"] if config["pigpen"]["genomeFASTA"] else "", + CS=" --controlsamples " + config["pigpen"]["controlsamples"] if config["pigpen"]["controlsamples"] else "", + VCF=" --snpfile " + config["pigpen"]["snpfile"] if config["pigpen"]["snpfile"] else "", + MB=" --maskbed " + config["pigpen"]["maskbed"] if config["pigpen"]["maskbed"] else "", + RB=" --ROIbed " + config["pigpen"]["ROIbed"] if config["pigpen"]["ROIbed"] else "", + SC=" --SNPcoverage " + config["pigpen"]["SNPcoverage"] if config["pigpen"]["SNPcoverage"] else "", + SF=" --SNPfreq " + config["pigpen"]["SNPfreq"] if config["pigpen"]["SNPfreq"] else "", + NC=" --nConv " + config["pigpen"]["nconv"] if config["pigpen"]["nconv"] else "", + MQ=" --minMappingQual " + config["pigpen"]["minMappingQual"] if config["pigpen"]["minMappingQual"] else "", + OT=" " + config["pigpen"]["tags"] if config["pigpen"]["tags"] else "", + OD=" --outputDir " + config["pigpen"]["outputDir"] if config["pigpen"]["outputDir"] else "", + output: + expand("PIGPEN_alignments/{outDir}/{sample}.pigpen.txt", sample=config["samples"], outDir =config["pigpen"]["outputDir"]), + threads: config["threads"] + run: + os.chdir('PIGPEN_alignments') + shell( + "python -u {input.pigpen} --samplenames {params.samples} --nproc {threads}" + "{params.gff}{params.fa}{params.CS}{params.VCF}{params.MB}{params.RB}" + "{params.SC}{params.SF}{params.NC}{params.MQ}{params.OT}{params.OD}" + ) + +elif config["libtype"]=="LEXO" and not config["dedupUMI"]: + rule runPigpen_LEXO: + """runs Pigpen with desired parameters""" + input: + pigpen=config["pigpen"]["pigpen"], + STARbam=expand("PIGPEN_alignments/{sample}/STAR/{sample}Aligned.sortedByCoord.out.bam", sample=config["samples"]), + STARbai=expand("PIGPEN_alignments/{sample}/STAR/{sample}Aligned.sortedByCoord.out.bam.bai", sample=config["samples"]), + PMbam=expand("PIGPEN_alignments/{sample}/postmaster/{sample}.postmaster.bam", sample=config["samples"]), + PMbai=expand("PIGPEN_alignments/{sample}/postmaster/{sample}.postmaster.bam", sample=config["samples"]), + params: + samples=",".join(config["samples"]), + gff=" --gff " + config["pigpen"]["gff"] if config["pigpen"]["gff"] else "", + fa=" --genomeFasta " + config["pigpen"]["genomeFASTA"] if config["pigpen"]["genomeFASTA"] else "", + CS=" --controlsamples " + config["pigpen"]["controlsamples"] if config["pigpen"]["controlsamples"] else "", + VCF=" --snpfile " + config["pigpen"]["snpfile"] if config["pigpen"]["snpfile"] else "", + MB=" --maskbed " + config["pigpen"]["maskbed"] if config["pigpen"]["maskbed"] else "", + RB=" --ROIbed " + config["pigpen"]["ROIbed"] if config["pigpen"]["ROIbed"] else "", + SC=" --SNPcoverage " + config["pigpen"]["SNPcoverage"] if config["pigpen"]["SNPcoverage"] else "", + SF=" --SNPfreq " + config["pigpen"]["SNPfreq"] if config["pigpen"]["SNPfreq"] else "", + NC=" --nConv " + config["pigpen"]["nconv"] if config["pigpen"]["nconv"] else "", + MQ=" --minMappingQual " + config["pigpen"]["minMappingQual"] if config["pigpen"]["minMappingQual"] else "", + OT=" " + config["pigpen"]["tags"] if config["pigpen"]["tags"] else "", + OD=" --outputDir " + config["pigpen"]["outputDir"] if config["pigpen"]["outputDir"] else "", + output: + expand("PIGPEN_alignments/{outDir}/{sample}.pigpen.txt", sample=config["samples"], outDir =config["pigpen"]["outputDir"]), + threads: config["threads"] + run: + os.chdir('PIGPEN_alignments') + shell( + "python -u {input.pigpen} --samplenames {params.samples} --nproc {threads}" + "{params.gff}{params.fa}{params.CS}{params.VCF}{params.MB}{params.RB}" + "{params.SC}{params.SF}{params.NC}{params.MQ}{params.OT}{params.OD}" + ) + +elif config["libtype"]=="SA" and config["dedupUMI"]: + rule runPigpen_SA_UMI: + """runs Pigpen with desired parameters""" + input: + pigpen=config["pigpen"]["pigpen"], + dedupbam=expand("PIGPEN_alignments/{sample}/STAR/{sample}.dedup.bam", sample=config["samples"]), + dedupbai=expand("PIGPEN_alignments/{sample}/STAR/{sample}.dedup.bam.bai", sample=config["samples"]), + params: + samples=",".join(config["samples"]), + gff=" --gff " + config["pigpen"]["gff"] if config["pigpen"]["gff"] else "", + fa=" --genomeFasta " + config["pigpen"]["genomeFASTA"] if config["pigpen"]["genomeFASTA"] else "", + CS=" --controlsamples " + config["pigpen"]["controlsamples"] if config["pigpen"]["controlsamples"] else "", + VCF=" --snpfile " + config["pigpen"]["snpfile"] if config["pigpen"]["snpfile"] else "", + MB=" --maskbed " + config["pigpen"]["maskbed"] if config["pigpen"]["maskbed"] else "", + RB=" --ROIbed " + config["pigpen"]["ROIbed"] if config["pigpen"]["ROIbed"] else "", + SC=" --SNPcoverage " + config["pigpen"]["SNPcoverage"] if config["pigpen"]["SNPcoverage"] else "", + SF=" --SNPfreq " + config["pigpen"]["SNPfreq"] if config["pigpen"]["SNPfreq"] else "", + NC=" --nConv " + config["pigpen"]["nconv"] if config["pigpen"]["nconv"] else "", + MQ=" --minMappingQual " + config["pigpen"]["minMappingQual"] if config["pigpen"]["minMappingQual"] else "", + OT=" " + config["pigpen"]["tags"] if config["pigpen"]["tags"] else "", + OD=" --outputDir " + config["pigpen"]["outputDir"] if config["pigpen"]["outputDir"] else "", + output: + expand("PIGPEN_alignments/{outDir}/{sample}.pigpen.txt", sample=config["samples"], outDir =config["pigpen"]["outputDir"]), + threads: config["threads"] + run: + os.chdir('PIGPEN_alignments') + shell( + "python -u {input.pigpen} --samplenames {params.samples} --nproc {threads}" + "{params.gff}{params.fa}{params.CS}{params.VCF}{params.MB}{params.RB}" + "{params.SC}{params.SF}{params.NC}{params.MQ}{params.OT}{params.OD}" + ) + +elif config["libtype"]=="SA" and not config["dedupUMI"]: + rule runPigpen_SA: + """runs Pigpen with desired parameters""" + input: + pigpen=config["pigpen"]["pigpen"], + STARbam=expand("PIGPEN_alignments/{sample}/STAR/{sample}Aligned.sortedByCoord.out.bam", sample=config["samples"]), + STARbai=expand("PIGPEN_alignments/{sample}/STAR/{sample}Aligned.sortedByCoord.out.bam.bai", sample=config["samples"]), + params: + samples=",".join(config["samples"]), + gff=" --gff " + config["pigpen"]["gff"] if config["pigpen"]["gff"] else "", + fa=" --genomeFasta " + config["pigpen"]["genomeFASTA"] if config["pigpen"]["genomeFASTA"] else "", + CS=" --controlsamples " + config["pigpen"]["controlsamples"] if config["pigpen"]["controlsamples"] else "", + VCF=" --snpfile " + config["pigpen"]["snpfile"] if config["pigpen"]["snpfile"] else "", + MB=" --maskbed " + config["pigpen"]["maskbed"] if config["pigpen"]["maskbed"] else "", + RB=" --ROIbed " + config["pigpen"]["ROIbed"] if config["pigpen"]["ROIbed"] else "", + SC=" --SNPcoverage " + config["pigpen"]["SNPcoverage"] if config["pigpen"]["SNPcoverage"] else "", + SF=" --SNPfreq " + config["pigpen"]["SNPfreq"] if config["pigpen"]["SNPfreq"] else "", + NC=" --nConv " + config["pigpen"]["nconv"] if config["pigpen"]["nconv"] else "", + MQ=" --minMappingQual " + config["pigpen"]["minMappingQual"] if config["pigpen"]["minMappingQual"] else "", + OT=" " + config["pigpen"]["tags"] if config["pigpen"]["tags"] else "", + OD=" --outputDir " + config["pigpen"]["outputDir"] if config["pigpen"]["outputDir"] else "", + output: + expand("PIGPEN_alignments/{outDir}/{sample}.pigpen.txt", sample=config["samples"], outDir =config["pigpen"]["outputDir"]), + threads: config["threads"] + run: + os.chdir('PIGPEN_alignments') + shell( + "python -u {input.pigpen} --samplenames {params.samples} --nproc {threads}" + "{params.gff}{params.fa}{params.CS}{params.VCF}{params.MB}{params.RB}" + "{params.SC}{params.SF}{params.NC}{params.MQ}{params.OT}{params.OD}" + ) + + diff --git a/workflow/rules/runSalmon.smk b/workflow/rules/runSalmon.smk new file mode 100644 index 0000000..63ffb1b --- /dev/null +++ b/workflow/rules/runSalmon.smk @@ -0,0 +1,47 @@ +""" +Rules for transcriptome alignment with Salmon +For usage, include this in your workflow. +""" + +if not config["dedupUMI"] and config["libtype"]=="LEXO": + rule runSalmon: + """aligns trimmed reads to transcriptome with Salmon""" + input: + read1=expand("PIGPEN_alignments/{sample}/STAR/{sample}.STARaligned.r1.fq.gz", sample=config["samples"]), + read2=expand("PIGPEN_alignments/{sample}/STAR/{sample}.STARaligned.r2.fq.gz", sample=config["samples"]), + Salmonindex=config["Salmonindex"], + params: + samplename=expand("PIGPEN_alignments/{sample}/salmon", sample=config["samples"]), + output: + sf=expand("PIGPEN_alignments/{sample}/salmon/{sample}.quant.sf", sample=config["samples"]), + bam=temp(expand("PIGPEN_alignments/{sample}/salmon/{sample}.salmon.bam", sample=config["samples"])), + threads: config["threads"] + run: + for r1,r2,samplename,bam in zip(input.read1,input.read2,params.samplename,output.bam): + shell( + "salmon quant --libType A -p {threads} --seqBias --gcBias " + "--validateMappings -1 {r1} -2 {r2} -o {sample} " + "--index {input.Salmonindex} --writeMappings={bam} --writeQualities" + ) + +elif config["dedupUMI"] and config["libtype"]=="LEXO": + rule runSalmonDedup: + """aligns trimmed reads to transcriptome with Salmon""" + input: + read1=expand("PIGPEN_alignments/{sample}/STAR/{sample}.dedup.r1.fq.gz", sample=config["samples"]), + read2=expand("PIGPEN_alignments/{sample}/STAR/{sample}.dedup.r2.fq.gz", sample=config["samples"]), + Salmonindex=config["Salmonindex"], + params: + samplename=expand("PIGPEN_alignments/{sample}/salmon", sample=config["samples"]), + output: + sf=expand("PIGPEN_alignments/{sample}/salmon/{sample}.quant.sf", sample=config["samples"]), + bam=temp(expand("PIGPEN_alignments/{sample}/salmon/{sample}.salmon.bam", sample=config["samples"])), + threads: config["threads"] + run: + for r1,r2,samplename,sf,bam in zip(input.read1,input.read2,params.samplename,output.sf,output.bam): + shell( + "salmon quant --libType A -p {threads} --seqBias --gcBias " + "--validateMappings -1 {r1} -2 {r2} -o {samplename} " + "--index {input.Salmonindex} --writeMappings={bam} --writeQualities" + ) + shell("mv {samplename}/quant.sf {sf}") diff --git a/workflow/rules/trimming.smk b/workflow/rules/trimming.smk new file mode 100644 index 0000000..cd3b486 --- /dev/null +++ b/workflow/rules/trimming.smk @@ -0,0 +1,101 @@ +""" +Rules for trimming NGS reads with cutadapt +(http://cutadapt.readthedocs.org/en/latest/guide.html#illumina-truseq) +For usage, include this in your workflow. + +Quantseq samples +Read1, trim AAAAAAAAAAAAAAAAAAAA from 3' end +Read2, trim AGATCGGAAGAGCGTCGTGTAGGGAAAGACGGTA from 3' end and TTTTTTTTTTTTTTTTTTTT from 5' end + +NEW QUANTSEQ STRATEGY +Step1: get rid of random hex bindng site (first 6 nt of read 1) and trim 3' adapter off read 1 (-u 6 ; -a AAAAAAAAAAAAAAAAAAAA) [make temporary output files] +Step2: Trim 5' adapter of read 2 (TTTTTTTTTTTTTTTTTTTT) [make temporary output files] +Step3: Try to trim 3' adapter of read 2 (AGATCGGAAGAGCGTCGTGTAGGGAAAGACGGTA). Write untrimmed reads (these are done). [make temporary outfile for trimmed reads] +Step4: For reads that did have 3' adapter on read 2, remove the last 6 bases on read 2 (UMI). These are now done too. +Step5: Combine trimmed reads from step4 with untrimmed reads from step3. + +SINGLE AMPLICON (SA) STRATEGY +Step1: get rid of primer binding sites (20nt) + internal barcode (6nt) +""" +if not config["libtype"]: + print("libtype must be included in the config file") + +elif config["libtype"]=="LEXO": + rule cutadapt_LEXO: + """Trims given paired-end reads with given parameters""" + input: + read1=expand("UMI_fastq/{sample}_R1.fq.gz", sample=config["samples"]), + read2=expand("UMI_fastq/{sample}_R2.fq.gz", sample=config["samples"]), + output: + outread1s1=temp(expand("cutadapt/{sample}_1.temp.step1.fq.gz", sample=config["samples"])), + outread2s1=temp(expand("cutadapt/{sample}_2.temp.step1.fq.gz", sample=config["samples"])), + statsouts1=expand("cutadapt/{sample}.cutadaptstats.step1.txt", sample=config["samples"]), + outread1s2=temp(expand("cutadapt/{sample}_1.temp.step2.fq.gz", sample=config["samples"])), + outread2s2=temp(expand("cutadapt/{sample}_2.temp.step2.fq.gz", sample=config["samples"])), + statsouts2=expand("cutadapt/{sample}.cutadaptstats.step2.txt", sample=config["samples"]), + outread1s3=temp(expand("cutadapt/{sample}_1.temp.step3.fq.gz", sample=config["samples"])), + outread2s3=temp(expand("cutadapt/{sample}_2.temp.step3.fq.gz", sample=config["samples"])), + statsouts3=expand("cutadapt/{sample}.cutadaptstats.step3.txt", sample=config["samples"]), + untrimmedr1s3=temp(expand("cutadapt/{sample}_1.untrimmed.step3.fq.gz", sample=config["samples"])), + untrimmedr2s3=temp(expand("cutadapt/{sample}_2.untrimmed.step3.fq.gz", sample=config["samples"])), + outread1s4=temp(expand("cutadapt/{sample}_1.temp.step4.fq.gz", sample=config["samples"])), + outread2s4=temp(expand("cutadapt/{sample}_2.temp.step4.fq.gz", sample=config["samples"])), + statsouts4=expand("cutadapt/{sample}.cutadaptstats.step4.txt", sample=config["samples"]), + finaloutread1=expand("cutadapt/{sample}_1.trimmed.fq.gz", sample=config["samples"]), + finaloutread2=expand("cutadapt/{sample}_2.trimmed.fq.gz", sample=config["samples"]), + threads: config["threads"] + run: + for Ir1,Ir2,O1r1,O1r2,S1,O2r1,O2r2,S2,O3r1,O3r2,S3,U3r1,U3r2,O4r1,O4r2,S4,OFr1,OFr2 in zip(input.read1,input.read2,output.outread1s1,output.outread2s1,output.statsouts1,output.outread1s2,output.outread2s2,output.statsouts2,output.outread1s3,output.outread2s3,output.statsouts3,output.untrimmedr1s3,output.untrimmedr2s3,output.outread1s4,output.outread2s4,output.statsouts4,output.finaloutread1,output.finaloutread2): + #Step1 + shell( + "cutadapt -u 6 -U 0 -a AAAAAAAAAAAAAAAAAAAA --minimum-length 25 " + "-j {threads} -o {O1r1} -p {O1r2} {Ir1} {Ir2} > {S1}" + ) + + #Step2 + shell( + "cutadapt -G TTTTTTTTTTTTTTTTTTTT --minimum-length 25 " + "-j {threads} -o {O2r1} -p {O2r2} {O1r1} {O1r2} > {S2}" + ) + + #Step3 + shell( + "cutadapt -A AGATCGGAAGAGCGTCGTGTAGGGAAAGACGGTA --minimum-length 25 " + "-j {threads} --untrimmed-output {U3r1} " + "--untrimmed-paired-output {U3r2} -o {O3r1} -p {O3r2} " + "{O2r1} {O2r2} > {S3}" + ) + + #Step4 + shell( + "cutadapt -U -6 --minimum-length 25 -j {threads} -o {O4r1} " + "-p {O4r2} {O3r1} {O3r2} > {S4}" + ) + + #Step5 + shell("cat {U3r1} {O4r1} > {OFr1}") + shell("cat {U3r2} {O4r2} > {OFr2}") + + +elif config["libtype"]=="SA": + rule cutadapt_SA: + """Trims given paired-end reads with given parameters""" + input: + read1=expand("UMI_fastq/{sample}_R1.fq.gz", sample=config["samples"]), + read2=expand("UMI_fastq/{sample}_R2.fq.gz", sample=config["samples"]), + output: + finaloutread1=expand("cutadapt/{sample}_1.trimmed.fq.gz", sample=config["samples"]), + finaloutread2=expand("cutadapt/{sample}_2.trimmed.fq.gz", sample=config["samples"]), + statsouts=expand("cutadapt/{sample}.cutadaptstats.txt", sample=config["samples"]), + threads: config["threads"] + run: + for Ir1,Ir2,S1,OFr1,OFr2 in zip(input.read1,input.read2,output.statsouts,output.finaloutread1,output.finaloutread2): + shell( + "cutadapt -u 26 -U 26 --minimum-length 25 " + "-j {threads} -o {OFr1} -p {OFr2} {Ir1} {Ir2} > {S1}" + ) + + +else: + print("libtype must be either 'LEXO' or 'SA'") + diff --git a/workflow/snakefile b/workflow/snakefile new file mode 100644 index 0000000..2179294 --- /dev/null +++ b/workflow/snakefile @@ -0,0 +1,17 @@ +configfile: "config/config.yaml" + +include: "rules/UMIextract.smk" +include: "rules/trimming.smk" +include: "rules/STARalign.smk" +include: "rules/UMIdedup.smk" +include: "rules/B2FQ.smk" +include: "rules/runSalmon.smk" +include: "rules/runPM.smk" +include: "rules/runPigpen.smk" +include: "rules/runBacon.smk" + +rule all: + input: + #expand("PIGPEN_alignments/{outDir}/{sample}.pigpen.txt", sample=config["samples"], outDir=config["pigpen"]["outputDir"]), + expand("PIGPEN_alignments/{outDir}/{sample}.bacon.txt", sample=config["bacon"]["output"], outDir=config["pigpen"]["outputDir"]), + From 4f620357875ccff60554516623f584d1ea7eb31e Mon Sep 17 00:00:00 2001 From: Matthew Taliaferro Date: Thu, 2 Mar 2023 09:22:16 -0700 Subject: [PATCH 051/108] readme update --- README.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/README.md b/README.md index 144c0a1..ff13429 100644 --- a/README.md +++ b/README.md @@ -42,6 +42,9 @@ PIGPEN has the following prerequisites: - bamtools >= 2.5.2 - salmon >= 1.9.0 - gffutils >= 0.11.0 +- umi_tools >= 1.1.0 (if UMI collapsing is desired) +- [postmaster](https://github.com/COMBINE-lab/postmaster) +>Note: postmaster is a [rust](https://www.rust-lang.org/) package. Installing it requires rust (which itself is installable using [conda](https://anaconda.org/conda-forge/rust)). Once rust is installed, use `cargo install --git https://github.com/COMBINE-lab/postmaster` to install postmaster. BACON has the following prerequisites: From 16ca1a0dd8063396ab5d296fd3d9cc0241ac221f Mon Sep 17 00:00:00 2001 From: Matthew Taliaferro Date: Thu, 2 Mar 2023 09:47:33 -0700 Subject: [PATCH 052/108] update alignandquants to handle single-end data --- alignAndQuant.py | 48 +++++++++++++++++++++++++++++++++-------------- alignAndQuant2.py | 34 +++++++++++++++++++++++++-------- 2 files changed, 60 insertions(+), 22 deletions(-) diff --git a/alignAndQuant.py b/alignAndQuant.py index 25df4c2..793a73e 100644 --- a/alignAndQuant.py +++ b/alignAndQuant.py @@ -34,12 +34,17 @@ def runSTAR(reads1, reads2, nthreads, STARindex, samplename): os.mkdir(outdir) prefix = outdir + '/' + samplename - command = ['STAR', '--runMode', 'alignReads', '--runThreadN', nthreads, '--genomeLoad', 'NoSharedMemory', '--genomeDir', STARindex, '--readFilesIn', reads1, reads2, '--readFilesCommand', - 'zcat', '--outFileNamePrefix', prefix, '--outSAMtype', 'BAM', 'SortedByCoordinate', '--outSAMstrandField', 'intronMotif', '-–outFilterMultimapNmax', '1', '--outSAMattributes', 'MD', 'NH'] + if reads2: + command = ['STAR', '--runMode', 'alignReads', '--runThreadN', nthreads, '--genomeLoad', 'NoSharedMemory', '--genomeDir', STARindex, '--readFilesIn', reads1, reads2, '--readFilesCommand', + 'zcat', '--outFileNamePrefix', prefix, '--outSAMtype', 'BAM', 'SortedByCoordinate', '--outSAMstrandField', 'intronMotif', '-–outFilterMultimapNmax', '1', '--outSAMattributes', 'MD', 'NH'] + + elif not reads2: + command = ['STAR', '--runMode', 'alignReads', '--runThreadN', nthreads, '--genomeLoad', 'NoSharedMemory', '--genomeDir', STARindex, '--readFilesIn', reads1, '--readFilesCommand', + 'zcat', '--outFileNamePrefix', prefix, '--outSAMtype', 'BAM', 'SortedByCoordinate', '--outSAMstrandField', 'intronMotif', '-–outFilterMultimapNmax', '1', '--outSAMattributes', 'MD', 'NH'] print('Running STAR for {0}...'.format(samplename)) - subprocess.call(command) + subprocess.run(command) #make index bam = os.path.join(outdir, samplename + 'Aligned.sortedByCoord.out.bam') @@ -52,7 +57,7 @@ def runSTAR(reads1, reads2, nthreads, STARindex, samplename): print('Finished STAR for {0}!'.format(samplename)) -def bamtofastq(samplename, nthreads): +def bamtofastq(samplename, nthreads, reads2): #Given a bam file of uniquely aligned reads (produced from runSTAR), rederive these reads as fastq in preparation for submission to salmon if not os.path.exists('STAR'): os.mkdir('STAR') @@ -72,7 +77,10 @@ def bamtofastq(samplename, nthreads): r1file = samplename + '.unique.r1.fq.gz' r2file = samplename + '.unique.r2.fq.gz' print('Writing fastq file of uniquely aligned reads for {0}...'.format(samplename)) - command = ['samtools', 'fastq', '--threads', nthreads, '-1', r1file, '-2', r2file, '-0', '/dev/null', '-s', '/dev/null', '-n', sortedbam] + if reads2: + command = ['samtools', 'fastq', '--threads', nthreads, '-1', r1file, '-2', r2file, '-0', '/dev/null', '-s', '/dev/null', '-n', sortedbam] + elif not reads2: + command = ['samtools', 'fastq', '--threads', nthreads, '-0', r1file, '-n', sortedbam] subprocess.call(command) print('Done writing fastq files for {0}!'.format(samplename)) @@ -87,16 +95,21 @@ def runSalmon(reads1, reads2, nthreads, salmonindex, samplename): idx = os.path.abspath(salmonindex) r1 = os.path.abspath(reads1) - r2 = os.path.abspath(reads2) + if reads2: + r2 = os.path.abspath(reads2) os.chdir('salmon') - command = ['salmon', 'quant', '--libType', 'A', '-p', nthreads, '--seqBias', '--gcBias', + if reads2: + command = ['salmon', 'quant', '--libType', 'A', '-p', nthreads, '--seqBias', '--gcBias', '--validateMappings', '-1', r1, '-2', r2, '-o', samplename, '--index', idx, '--writeMappings={0}.salmon.bam'.format(samplename), '--writeQualities'] + elif not reads2: + command = ['salmon', 'quant', '--libType', 'A', '-p', nthreads, '--seqBias', + '--validateMappings', '-r', r1, '-o', samplename, '--index', idx, '--writeMappings={0}.salmon.bam'.format(samplename), '--writeQualities'] print('Running salmon for {0}...'.format(samplename)) - subprocess.call(command) + subprocess.run(command) #Move output outputdir = os.path.join(os.getcwd(), samplename) @@ -106,7 +119,8 @@ def runSalmon(reads1, reads2, nthreads, salmonindex, samplename): #Remove uniquely aligning read files os.remove(r1) - os.remove(r2) + if reads2: + os.remove(r2) print('Finished salmon for {0}!'.format(samplename)) @@ -132,7 +146,7 @@ def runPostmaster(samplename, nthreads): os.rename(outputfile + '.sort', outputfile) command = ['samtools', 'index', outputfile] - subprocess.call(command) + subprocess.run(command) #We don't need the salmon alignment file anymore, and it's pretty big os.remove(salmonbam) @@ -151,7 +165,7 @@ def addMD(samplename, reffasta, nthreads): if __name__ == '__main__': parser = argparse.ArgumentParser(description = 'Align and quantify reads using STAR, salmon, and postmaster in preparation for analysis with PIGPEN.') parser.add_argument('--forwardreads', type = str, help = 'Forward reads. Gzipped fastq.') - parser.add_argument('--reversereads', type = str, help = 'Reverse reads. Gzipped fastq.') + parser.add_argument('--reversereads', type = str, help = 'Reverse reads. Gzipped fastq. Do not supply if using single end reads.') parser.add_argument('--nthreads', type = str, help = 'Number of threads to use for alignment and quantification.') parser.add_argument('--STARindex', type = str, help = 'STAR index directory.') parser.add_argument('--salmonindex', type = str, help = 'Salmon index directory.') @@ -159,7 +173,10 @@ def addMD(samplename, reffasta, nthreads): args = parser.parse_args() r1 = os.path.abspath(args.forwardreads) - r2 = os.path.abspath(args.reversereads) + if args.reversereads: + r2 = os.path.abspath(args.reversereads) + elif not args.reversereads: + r2 = None STARindex = os.path.abspath(args.STARindex) salmonindex = os.path.abspath(args.salmonindex) samplename = args.samplename @@ -174,10 +191,13 @@ def addMD(samplename, reffasta, nthreads): #uniquely aligning read files uniquer1 = samplename + '.unique.r1.fq.gz' - uniquer2 = samplename + '.unique.r2.fq.gz' + if args.reversereads: + uniquer2 = samplename + '.unique.r2.fq.gz' + elif not args.reversereads: + uniquer2 = None runSTAR(r1, r2, nthreads, STARindex, samplename) - bamtofastq(samplename, nthreads) + bamtofastq(samplename, nthreads, r2) runSalmon(uniquer1, uniquer2, nthreads, salmonindex, samplename) os.chdir(sampledir) runPostmaster(samplename, nthreads) diff --git a/alignAndQuant2.py b/alignAndQuant2.py index 5fca1cb..a2c2337 100644 --- a/alignAndQuant2.py +++ b/alignAndQuant2.py @@ -39,8 +39,14 @@ def runSTAR(reads1, reads2, nthreads, STARindex, samplename): os.mkdir(outdir) prefix = outdir + '/' + samplename - command = ['STAR', '--runMode', 'alignReads', '--runThreadN', nthreads, '--genomeLoad', 'NoSharedMemory', '--genomeDir', STARindex, '--readFilesIn', reads1, reads2, '--readFilesCommand', - 'zcat', '--outFileNamePrefix', prefix, '--outSAMtype', 'BAM', 'SortedByCoordinate', '--outSAMstrandField', 'intronMotif', '--outSAMmultNmax', '1', '--outSAMattributes', 'MD', 'NH'] + if reads2: + command = ['STAR', '--runMode', 'alignReads', '--runThreadN', nthreads, '--genomeLoad', 'NoSharedMemory', '--genomeDir', STARindex, '--readFilesIn', reads1, reads2, '--readFilesCommand', + 'zcat', '--outFileNamePrefix', prefix, '--outSAMtype', 'BAM', 'SortedByCoordinate', '--outSAMstrandField', 'intronMotif', '--outSAMmultNmax', '1', '--outSAMattributes', 'MD', 'NH'] + + elif not reads2: + command = ['STAR', '--runMode', 'alignReads', '--runThreadN', nthreads, '--genomeLoad', 'NoSharedMemory', '--genomeDir', STARindex, '--readFilesIn', reads1, '--readFilesCommand', + 'zcat', '--outFileNamePrefix', prefix, '--outSAMtype', 'BAM', 'SortedByCoordinate', '--outSAMstrandField', 'intronMotif', '--outSAMmultNmax', '1', '--outSAMattributes', 'MD', 'NH'] + print('Running STAR for {0}...'.format(samplename)) @@ -57,7 +63,7 @@ def runSTAR(reads1, reads2, nthreads, STARindex, samplename): print('Finished STAR for {0}!'.format(samplename)) -def bamtofastq(samplename, nthreads): +def bamtofastq(samplename, nthreads, reads2): #Given a bam file of uniquely aligned reads (produced from runSTAR), rederive these reads as fastq in preparation for submission to salmon #This function isn't needed anymore as we will align all reads. if not os.path.exists('STAR'): @@ -78,7 +84,10 @@ def bamtofastq(samplename, nthreads): r1file = samplename + '.unique.r1.fq.gz' r2file = samplename + '.unique.r2.fq.gz' print('Writing fastq file of uniquely aligned reads for {0}...'.format(samplename)) - command = ['samtools', 'fastq', '--threads', nthreads, '-1', r1file, '-2', r2file, '-0', '/dev/null', '-s', '/dev/null', '-n', sortedbam] + if reads2: + command = ['samtools', 'fastq', '--threads', nthreads, '-1', r1file, '-2', r2file, '-0', '/dev/null', '-s', '/dev/null', '-n', sortedbam] + elif not reads2: + command = ['samtools', 'fastq', '--threads', nthreads, '-0', r1file, '-n', sortedbam] subprocess.call(command) print('Done writing fastq files for {0}!'.format(samplename)) @@ -92,12 +101,17 @@ def runSalmon(reads1, reads2, nthreads, salmonindex, samplename): idx = os.path.abspath(salmonindex) r1 = os.path.abspath(reads1) - r2 = os.path.abspath(reads2) + if reads2: + r2 = os.path.abspath(reads2) os.chdir('salmon') - command = ['salmon', 'quant', '--libType', 'A', '-p', nthreads, '--seqBias', '--gcBias', + if reads2: + command = ['salmon', 'quant', '--libType', 'A', '-p', nthreads, '--seqBias', '--gcBias', '--validateMappings', '-1', r1, '-2', r2, '-o', samplename, '--index', idx, '--writeMappings={0}.salmon.bam'.format(samplename), '--writeQualities'] + elif not reads2: + command = ['salmon', 'quant', '--libType', 'A', '-p', nthreads, '--seqBias', + '--validateMappings', '-r', r1, '-o', samplename, '--index', idx, '--writeMappings={0}.salmon.bam'.format(samplename), '--writeQualities'] print('Running salmon for {0}...'.format(samplename)) @@ -109,6 +123,7 @@ def runSalmon(reads1, reads2, nthreads, salmonindex, samplename): movedquantfile = os.path.join(os.getcwd(), '{0}.quant.sf'.format(samplename)) os.rename(quantfile, movedquantfile) + print('Finished salmon for {0}!'.format(samplename)) def runPostmaster(samplename, nthreads): @@ -152,7 +167,7 @@ def addMD(samplename, reffasta, nthreads): if __name__ == '__main__': parser = argparse.ArgumentParser(description = 'Align and quantify reads using STAR, salmon, and postmaster in preparation for analysis with PIGPEN.') parser.add_argument('--forwardreads', type = str, help = 'Forward reads. Gzipped fastq.') - parser.add_argument('--reversereads', type = str, help = 'Reverse reads. Gzipped fastq.') + parser.add_argument('--reversereads', type = str, help = 'Reverse reads. Gzipped fastq. Do not supply if using single end reads.') parser.add_argument('--nthreads', type = str, help = 'Number of threads to use for alignment and quantification.') parser.add_argument('--STARindex', type = str, help = 'STAR index directory.') parser.add_argument('--salmonindex', type = str, help = 'Salmon index directory.') @@ -160,7 +175,10 @@ def addMD(samplename, reffasta, nthreads): args = parser.parse_args() r1 = os.path.abspath(args.forwardreads) - r2 = os.path.abspath(args.reversereads) + if args.reversereads: + r2 = os.path.abspath(args.reversereads) + elif not args.reversereads: + r2 = None STARindex = os.path.abspath(args.STARindex) salmonindex = os.path.abspath(args.salmonindex) samplename = args.samplename From 0128cddd37ac5d66e5e59e236e32f2be23ceb6a9 Mon Sep 17 00:00:00 2001 From: Matthew Taliaferro Date: Mon, 6 Mar 2023 09:17:17 -0700 Subject: [PATCH 053/108] small readme update --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index ff13429..4d1a170 100644 --- a/README.md +++ b/README.md @@ -142,7 +142,7 @@ PIGPEN can use G -> T conversions, G -> C conversions, or both when calculating ## Using one read of a paired end sample -The use of one read in a paired end sample for conversion quantification can be controlled using `--use_read1` and `--use_read2`. To use both reads, supply both flags. `--onlyConsiderOverlap` requires the use of both reads. Importantly, both reads are still used for genomic alignment and transcript quantification. +The use of one read in a paired end sample for conversion quantification can be controlled using `--use_read1` and `--use_read2`. To use both reads, supply both flags. `--onlyConsiderOverlap` requires the use of both reads. Importantly, both reads can still used for genomic alignment and transcript quantification. ## Mask specific positions From 0a8eeecc4cefeb02d3354c191d651b1abbe1c338 Mon Sep 17 00:00:00 2001 From: Matthew Taliaferro Date: Tue, 7 Mar 2023 15:06:31 -0700 Subject: [PATCH 054/108] unique mappers only actually recognized in alignandquant --- alignAndQuant.py | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/alignAndQuant.py b/alignAndQuant.py index 793a73e..6e09b1a 100644 --- a/alignAndQuant.py +++ b/alignAndQuant.py @@ -36,11 +36,11 @@ def runSTAR(reads1, reads2, nthreads, STARindex, samplename): if reads2: command = ['STAR', '--runMode', 'alignReads', '--runThreadN', nthreads, '--genomeLoad', 'NoSharedMemory', '--genomeDir', STARindex, '--readFilesIn', reads1, reads2, '--readFilesCommand', - 'zcat', '--outFileNamePrefix', prefix, '--outSAMtype', 'BAM', 'SortedByCoordinate', '--outSAMstrandField', 'intronMotif', '-–outFilterMultimapNmax', '1', '--outSAMattributes', 'MD', 'NH'] + 'zcat', '--outFileNamePrefix', prefix, '--outSAMtype', 'BAM', 'SortedByCoordinate', '--outSAMstrandField', 'intronMotif', '--outSAMattributes', 'MD', 'NH', '-–outFilterMultimapNmax', '1'] elif not reads2: command = ['STAR', '--runMode', 'alignReads', '--runThreadN', nthreads, '--genomeLoad', 'NoSharedMemory', '--genomeDir', STARindex, '--readFilesIn', reads1, '--readFilesCommand', - 'zcat', '--outFileNamePrefix', prefix, '--outSAMtype', 'BAM', 'SortedByCoordinate', '--outSAMstrandField', 'intronMotif', '-–outFilterMultimapNmax', '1', '--outSAMattributes', 'MD', 'NH'] + 'zcat', '--outFileNamePrefix', prefix, '--outSAMtype', 'BAM', 'SortedByCoordinate', '--outSAMstrandField', 'intronMotif', '--outSAMattributes', 'MD', 'NH', '-–outFilterMultimapNmax', '1'] print('Running STAR for {0}...'.format(samplename)) From 5ad8859f566fc855d582dd6a367e88e9846518b1 Mon Sep 17 00:00:00 2001 From: Matthew Taliaferro Date: Wed, 8 Mar 2023 15:02:57 -0700 Subject: [PATCH 055/108] add pigpen support for single end reads --- getmismatches.py | 134 ++++++++++++++++++++++++++++++----------------- pigpen.py | 33 ++++++++---- 2 files changed, 110 insertions(+), 57 deletions(-) diff --git a/getmismatches.py b/getmismatches.py index 6594f11..a1428a6 100644 --- a/getmismatches.py +++ b/getmismatches.py @@ -30,7 +30,7 @@ def revcomp(nt): return nt_rc -def iteratereads_singleend(bam, snps = None): +def iteratereads_singleend(bam, use_g_t, use_g_c, nConv, minMappingQual, snps = None, maskpositions = None, verbosity = 'high'): #Read through a bam containing single end reads (or if it contains paired end reads, just use read 1) #Find nt conversion locations for each read. #Store the number of each conversion for each read in a dictionary. @@ -39,52 +39,70 @@ def iteratereads_singleend(bam, snps = None): readcounter = 0 convs = {} #{readid : dictionary of all conversions} + save = pysam.set_verbosity(0) with pysam.AlignmentFile(bam, 'r') as infh: - print('Finding nucleotide conversions in {0}...'.format(os.path.basename(bam))) + if verbosity == 'high': + print('Finding nucleotide conversions in {0}...'.format(os.path.basename(bam))) for read in infh.fetch(until_eof = True): - if read.is_secondary or read.is_supplementary or read.is_unmapped: + if read.is_secondary or read.is_supplementary or read.is_unmapped or read.mapping_quality < minMappingQual: continue - if read.is_read1: - readcounter +=1 - if readcounter % 10000000 == 0: + readcounter +=1 + if readcounter % 10000000 == 0: + if verbosity == 'high': print('Finding nucleotide conversions in read {0}...'.format(readcounter)) + + queryname = read.query_name + queryseq = read.query_sequence #this is always on the + strand, no matter what strand the read maps to + chrm = read.reference_name + qualities = list(read.query_qualities) - #Check mapping quality - #For nextgenmap, max mapq is 60 - if read.mapping_quality < 255: - continue - - queryname = read.query_name - queryseq = read.query_sequence #this is always on the + strand, no matter what strand the read maps to - chrom = read.reference_name - qualities = list(read.query_qualities) - - #Get a set of snp locations if we have them - if snps: - if chrm in snps: - snplocations = snps[chrm] #set of coordinates to mask - else: - snplocations = None + #Get a set of snp locations if we have them + if snps: + if chrm in snps: + snplocations = snps[chrm] #set of coordinates to mask else: snplocations = None + else: + snplocations = None - if read.is_reverse: - strand = '-' - elif not read.is_reverse: - strand = '+' + #Get a set of locations to mask if we have them + if maskpositions: + if chrm in maskpositions: + # set of coordinates to manually mask + masklocations = maskpositions[chrm] + else: + masklocations = None + else: + masklocations = None - alignedpairs = read.get_aligned_pairs(with_seq = True) - readqualities = list(read.query_qualities) - convs_in_read = getmismatches_singleend(alignedpairs, queryseq, readqualities, strand, chrom, snplocations) - - convs[queryname] = convs_in_read + #combine snps and manually masked positions into one set + #this combined set will be masklocations + if snplocations and masklocations: + masklocations.update(snplocations) + elif snplocations and not masklocations: + masklocations = snplocations + elif masklocations and not snplocations: + masklocations = masklocations + + if read.is_reverse: + strand = '-' + elif not read.is_reverse: + strand = '+' + alignedpairs = read.get_aligned_pairs(with_seq = True) + readqualities = list(read.query_qualities) + convs_in_read = getmismatches_singleend(alignedpairs, queryseq, readqualities, strand, masklocations, nConv, use_g_t, use_g_c) + + convs[queryname] = convs_in_read + + if verbosity == 'high': + print('Queried {0} read pairs in {1}.'.format(readcounter, os.path.basename(bam))) #Pickle and write convs? - return convs + return convs, readcounter -def getmismatches_singleend(alignedpairs, queryseq, readqualities, strand, chrom, snplocations): +def getmismatches_singleend(alignedpairs, queryseq, readqualities, strand, masklocations, nConv, use_g_t, use_g_c): #remove tuples that have None #These are either intronic or might have been soft-clipped #Tuples are (querypos, (0-based) refpos, refsequence) @@ -93,17 +111,30 @@ def getmismatches_singleend(alignedpairs, queryseq, readqualities, strand, chrom #refnt and querynt, as supplied by pysam, are always + strand #everything here is assuming read is on sense strand - #snplocations is a set of chrm_coord locations. At these locations, all queries will be treated as not having a conversion + #masklocations is a set of chrm_coord locations. At these locations, all queries will be treated as not having a conversion + + #remove positions where querypos is None + #i'm pretty sure these query positions won't have quality scores + alignedpairs = [x for x in alignedpairs if x[0] != None] + #Add quality scores to alignedpairs tuples + #will now be (querypos, refpos, refseqeunce, qualityscore) + ap_withq = [] + for ind, x in enumerate(alignedpairs): + x += (readqualities[ind],) + ap_withq.append(x) + alignedpairs = ap_withq + + #Now remove positions where refsequence is None + #These may be places that got soft-clipped alignedpairs = [x for x in alignedpairs if None not in x] - #if we have snps, remove their locations from alignedpairs - #snplocations is a set of 0-based coordinates of snp locations to mask - if snplocations: - alignedpairs = [x for x in alignedpairs if x[1] not in snplocations] + #if we have locations to mask, remove their locations from alignedpairs + #masklocations is a set of 0-based coordinates of snp locations to mask + if masklocations: + alignedpairs = [x for x in alignedpairs if x[1] not in masklocations] convs = {} #counts of conversions x_y where x is reference sequence and y is query sequence - convlocations = defaultdict(list) #{type of conv : [locations of conversion]} possibleconvs = [ 'a_a', 'a_t', 'a_c', 'a_g', 'a_n', @@ -137,7 +168,7 @@ def getmismatches_singleend(alignedpairs, queryseq, readqualities, strand, chrom #If the quality at this position passes threshold, record the conversion. #Otherwise, skip it. - qscore = readqualities[alignedpair[0]] + qscore = alignedpair[3] if qscore >= 30: #can change this later convs[conv] +=1 else: @@ -615,7 +646,7 @@ def split_bam(bam, nproc): return splitbams -def getmismatches(bam, onlyConsiderOverlap, snps, maskpositions, nConv, minMappingQual, nproc, use_g_t, use_g_c, use_read1, use_read2): +def getmismatches(datatype, bam, onlyConsiderOverlap, snps, maskpositions, nConv, minMappingQual, nproc, use_g_t, use_g_c, use_read1, use_read2): #Actually run the mismatch code (calling iteratereads_pairedend) #use multiprocessing #If there's only one processor, easier to use iteratereads_pairedend() directly. @@ -625,12 +656,18 @@ def getmismatches(bam, onlyConsiderOverlap, snps, maskpositions, nConv, minMappi splitbams = split_bam(bam, int(nproc)) argslist = [] for x in splitbams: - argslist.append((x, bool(onlyConsiderOverlap), bool( - use_g_t), bool(use_g_c), bool(use_read1), bool(use_read2), nConv, minMappingQual, snps, maskpositions, 'low')) + if datatype == 'paired': + argslist.append((x, bool(onlyConsiderOverlap), bool( + use_g_t), bool(use_g_c), bool(use_read1), bool(use_read2), nConv, minMappingQual, snps, maskpositions, 'low')) + elif datatype == 'single': + argslist.append((x, bool(use_g_t), bool(use_g_c), nConv, minMappingQual, snps, maskpositions, 'low')) #items returned from iteratereads_pairedend are in a list, one per process totalreadcounter = 0 #number of reads across all the split bams - results = pool.starmap(iteratereads_pairedend, argslist) #thhis actually returns two things, convs and readcounter + if datatype == 'paired': + results = pool.starmap(iteratereads_pairedend, argslist) #thhis actually returns two things, convs and readcounter + elif datatype == 'single': + results = pool.starmap(iteratereads_singleend, argslist) #so i bet this is a nested list where the first item in each list in a convs and the second item is a readcounter convs_split = [] for result in results: @@ -638,7 +675,10 @@ def getmismatches(bam, onlyConsiderOverlap, snps, maskpositions, nConv, minMappi for result in results: totalreadcounter += result[1] - print('Queried {0} read pairs in {1}.'.format(totalreadcounter, os.path.basename(bam))) + if datatype == 'paired': + print('Queried {0} read pairs in {1}.'.format(totalreadcounter, os.path.basename(bam))) + elif datatype == 'single': + print('Queried {0} reads in {1}.'.format(totalreadcounter, os.path.basename(bam))) #Reorganize convs_split into convs as it is without multiprocessing convs = {} #{readid : dictionary of all conversions} @@ -656,7 +696,7 @@ def getmismatches(bam, onlyConsiderOverlap, snps, maskpositions, nConv, minMappi if __name__ == '__main__': - convs, readcounter = iteratereads_pairedend(sys.argv[1], False, True, True, True, True, 1, None, None, 'high') - #summarize_convs(convs, sys.argv[2]) + convs, readcounter = iteratereads_singleend(sys.argv[1], True, True, 1, 255, None, None, 'high') + summarize_convs(convs, sys.argv[2]) diff --git a/pigpen.py b/pigpen.py index 7f865ec..0dad84a 100644 --- a/pigpen.py +++ b/pigpen.py @@ -14,6 +14,7 @@ if __name__ == '__main__': parser = argparse.ArgumentParser(description=' ,-,-----,\n PIGPEN **** \\ \\ ),)`-\'\n <`--\'> \\ \\` \n /. . `-----,\n OINC! > (\'\') , @~\n `-._, ___ /\n-|-|-|-|-|-|-|-| (( / (( / -|-|-| \n|-|-|-|-|-|-|-|- \'\'\' \'\'\' -|-|-|-\n-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|\n\n Pipeline for Identification \n Of Guanosine Positions\n Erroneously Notated', formatter_class = argparse.RawDescriptionHelpFormatter) + parser.add_argument('--datatype', type = str, choices = ['single', 'paired'], required = True, help = 'Single end or paired end data?') parser.add_argument('--samplenames', type = str, help = 'Comma separated list of samples to quantify.', required = True) parser.add_argument('--controlsamples', type = str, help = 'Comma separated list of control samples (i.e. those where no *induced* conversions are expected). May be a subset of samplenames. Required if SNPs are to be considered and a snpfile is not supplied.') parser.add_argument('--gff', type = str, help = 'Genome annotation in gff format.') @@ -25,17 +26,23 @@ parser.add_argument('--ROIbed', help = 'Optional. Bed file of specific regions of interest in which to quantify conversions. If supplied, only conversions in these regions will be quantified.', default = None) parser.add_argument('--SNPcoverage', type = int, help = 'Minimum coverage to call SNPs. Default = 20', default = 20) parser.add_argument('--SNPfreq', type = float, help = 'Minimum variant frequency to call SNPs. Default = 0.4', default = 0.4) - parser.add_argument('--onlyConsiderOverlap', action = 'store_true', help = 'Only consider conversions seen in both reads of a read pair?') + parser.add_argument('--onlyConsiderOverlap', action = 'store_true', help = 'Only consider conversions seen in both reads of a read pair? Only possible with paired end data.') parser.add_argument('--use_g_t', action = 'store_true', help = 'Consider G->T conversions?') parser.add_argument('--use_g_c', action = 'store_true', help = 'Consider G->C conversions?') - parser.add_argument('--use_read1', action = 'store_true', help = 'Use read1 when looking for conversions?') - parser.add_argument('--use_read2', action = 'store_true', help = 'Use read2 when looking for conversions?') + parser.add_argument('--use_read1', action = 'store_true', help = 'Use read1 when looking for conversions? Only useful with paired end data.') + parser.add_argument('--use_read2', action = 'store_true', help = 'Use read2 when looking for conversions? Only useful with paired end data.') parser.add_argument('--minMappingQual', type = int, help = 'Minimum mapping quality for a read to be considered in conversion counting. STAR unique mappers have MAPQ 255.', required = True) - parser.add_argument('--nConv', type = int, help = 'Minimum number of required G->T and/or G->C conversions in a read pair in order for conversions to be counted. Default is 1.', default = 1) + parser.add_argument('--nConv', type = int, help = 'Minimum number of required G->T and/or G->C conversions in a read pair in order for those conversions to be counted. Default is 1.', default = 1) parser.add_argument('--outputDir', type = str, help = 'Output directory.', required = True) - parser.add_argument('--dedupUMI', action = 'store_true', help = 'Deduplicate UMIs? requires UMI extract.') + parser.add_argument('--dedupUMI', action = 'store_true', help = 'Use deduplicated UMIs? Requires --dedupUMI to have been supplied to alignandquant.py.') args = parser.parse_args() + #If we have single end data, considering overlap of paired reads or only one read doesn't make sense + if args.datatype == 'single': + args.onlyConsiderOverlap = False + args.use_read1 = False + args.use_read2 = False + #Store command line arguments suppliedargs = {} for arg in vars(args): @@ -46,8 +53,8 @@ #Derive quant.sf, STAR bams, and postmaster bams samplenames = args.samplenames.split(',') salmonquants = [os.path.join(x, 'salmon', '{0}.quant.sf'.format(x)) for x in samplenames] - starbams = [os.path.join(x, 'STAR', '{0}Aligned.sortedByCoord.out.bam'.format(x)) for x in samplenames] - dedupbams = [os.path.join(x, 'STAR', '{0}.dedup.bam'.format(x)) for x in samplenames] + starbams = [os.path.join(x, 'STAR', '{0}Aligned.sortedByCoord.out.bam'.format(x)) for x in samplenames] #non-deduplicated bams + dedupbams = [os.path.join(x, 'STAR', '{0}.dedup.bam'.format(x)) for x in samplenames] #deduplicated bams postmasterbams = [os.path.join(x, 'postmaster', '{0}.postmaster.bam'.format(x)) for x in samplenames] #Take in list of control samples, make list of their corresponding star bams for SNP calling @@ -72,7 +79,7 @@ sys.exit() #We have to be using either read1 or read2 if not both - if not args.use_read1 and not args.use_read2: + if not args.use_read1 and not args.use_read2 and args.datatype == 'paired': print('We need to use read1 or read2, if not both! Add argument --use_read1 and/or --use_read2.') sys.exit() @@ -127,9 +134,15 @@ samplebam = samplebams[ind] sampleparams['samplebam'] = os.path.abspath(samplebam) if args.nproc == 1: - convs, readcounter = iteratereads_pairedend(samplebam, args.onlyConsiderOverlap, args.use_g_t, args.use_g_c, args.use_read1, args.use_read2, args.nConv, args.minMappingQual, snps, maskpositions, 'high') + if args.datatype == 'paired': + convs, readcounter = iteratereads_pairedend(samplebam, args.onlyConsiderOverlap, args.use_g_t, args.use_g_c, + args.use_read1, args.use_read2, args.nConv, args.minMappingQual, snps, maskpositions, 'high') + elif args.datatype == 'single': + convs, readcounter = iterratereads_singleend( + samplebam, args.use_g_t, args.use_g_c, args.nConv, args.minMappingQual, snps, maskpostions, 'high') elif args.nproc > 1: - convs = getmismatches(samplebam, args.onlyConsiderOverlap, snps, maskpositions, args.nConv, args.minMappingQual, args.nproc, args.use_g_t, args.use_g_c, args.use_read1, args.use_read2) + convs = getmismatches(args.datatype, samplebam, args.onlyConsiderOverlap, snps, maskpositions, args.nConv, + args.minMappingQual, args.nproc, args.use_g_t, args.use_g_c, args.use_read1, args.use_read2) print('Getting posterior probabilities from salmon alignment file...') postmasterbam = postmasterbams[ind] From 58c0a4d80fe90ffdcabc9c385eb5418c3dcec4ec Mon Sep 17 00:00:00 2001 From: Matthew Taliaferro Date: Thu, 13 Apr 2023 08:45:13 -0600 Subject: [PATCH 056/108] minor updates for UMI extraction and quant --- ExtractUMI.py | 4 ++-- alignUMIquant.py | 7 +++++-- 2 files changed, 7 insertions(+), 4 deletions(-) diff --git a/ExtractUMI.py b/ExtractUMI.py index 5ed59a3..d21248f 100644 --- a/ExtractUMI.py +++ b/ExtractUMI.py @@ -45,13 +45,13 @@ def runExtract(r1, r2, samplename, lib_type): parser = argparse.ArgumentParser(description = 'Extract UMIs using umi-tools in preparation for analysis with AlignUMIquant.') parser.add_argument('--forwardreads', type = str, help = 'Forward reads. Gzipped fastq.', required = True) parser.add_argument('--reversereads', type = str, help = 'Reverse reads. Gzipped fastq.', required = True) - parser.add_argument('--samplenames', type = str, help = 'Sample name. Will be appended to output files.', required = True) + parser.add_argument('--samplename', type = str, help = 'Sample name. Will be appended to output files.', required = True) parser.add_argument('--lib_type', type = str, help = 'Library type. Either "LEXO" or "SA"', required = True) args = parser.parse_args() r1 = os.path.abspath(args.forwardreads) r2 = os.path.abspath(args.reversereads) - samplename = args.samplenames + samplename = args.samplename lib_type = args.lib_type if args.lib_type not in ["LEXO", "SA"]: diff --git a/alignUMIquant.py b/alignUMIquant.py index 1289164..feafb2f 100644 --- a/alignUMIquant.py +++ b/alignUMIquant.py @@ -9,9 +9,12 @@ This bam will be deduplicated with UMI-tools then passed to salmon(for read assignment). It will then run postmaster to append transcript assignments to the salmon-produced bam. -This is going to take in gzipped fastqs with UMIs extractsd, +This is going to take in gzipped fastqs with UMIs extracted, a directory containing the STAR index for this genome, and a directory containing the salmon index for this genome. +This means that, in addition to any adapter trimming, ***reads must have been first processed with umi_tools extract***. +For quantseq libraries, this corresponds to the first 6 nt of read 1. + Reads are aligned to the genome using STAR. This bam file will be used for mutation calling. In this alignment, we allow multiple mapping reads, but only report the best alignment. This bam will then be deduplicated based on UMI and alignment position. @@ -76,7 +79,7 @@ def runDedup(samplename, nthreads): subprocess.run(command) #We don't need the STAR alignment file anymore, and it's pretty big - # os.remove(STARbam) + os.remove(STARbam) print('Finished deduplicating {0}!'.format(samplename)) From 9b32deb480725cc84446469bb47c276db7825e7a Mon Sep 17 00:00:00 2001 From: Matthew Taliaferro Date: Thu, 13 Apr 2023 09:44:21 -0600 Subject: [PATCH 057/108] add datatype param to use with ROIbed --- pigpen.py | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/pigpen.py b/pigpen.py index 0dad84a..e939eea 100644 --- a/pigpen.py +++ b/pigpen.py @@ -185,10 +185,15 @@ samplebam = samplebams[ind] sampleparams['samplebam'] = os.path.abspath(samplebam) if args.nproc == 1: - convs, readcounter = iteratereads_pairedend( - samplebam, args.onlyConsiderOverlap, args.use_g_t, args.use_g_c, args.use_read1, args.use_read2, args.nConv, args.minMappingQual, snps, maskpositions, 'high') + if args.datatype == 'paired': + convs, readcounter = iteratereads_pairedend(samplebam, args.onlyConsiderOverlap, args.use_g_t, args.use_g_c, + args.use_read1, args.use_read2, args.nConv, args.minMappingQual, snps, maskpositions, 'high') + elif args.datatype == 'single': + convs, readcounter = iterratereads_singleend( + samplebam, args.use_g_t, args.use_g_c, args.nConv, args.minMappingQual, snps, maskpostions, 'high') + elif args.nproc > 1: - convs = getmismatches(samplebam, args.onlyConsiderOverlap, snps, maskpositions, + convs = getmismatches(args.datatype, samplebam, args.onlyConsiderOverlap, snps, maskpositions, args.nConv, args.minMappingQual, args.nproc, args.use_g_t, args.use_g_c, args.use_read1, args.use_read2) print('Assigning reads to genes in supplied bed file...') From 7de6de11f99a4644b8214e94321fd7821c6819fb Mon Sep 17 00:00:00 2001 From: Matthew Taliaferro Date: Thu, 27 Jul 2023 10:37:26 -0600 Subject: [PATCH 058/108] optionally allow multimappers with alignAndQuant --- alignAndQuant.py | 62 +++++++++++++++++++++++++++++++----------------- 1 file changed, 40 insertions(+), 22 deletions(-) diff --git a/alignAndQuant.py b/alignAndQuant.py index 6e09b1a..7760a4d 100644 --- a/alignAndQuant.py +++ b/alignAndQuant.py @@ -18,9 +18,14 @@ #the salmon output is .quant.sf and .salmon.bam in salmon/, #and the postmaster output is .postmaster.bam in postmaster/ +#If --allowmultimap is provided, then all reads are given to salmon for quantification. If it is not provided, +#then only uniquely aligning reads are written to the STAR alignment and later provided to salmon for quantification. + +#Keep in mind that pigpen.py has a minimum quality score filter that can also be utilized later for filtering multimapping reads. + #Requires STAR, salmon(>= 1.9.0), and postmaster be in user's PATH. -def runSTAR(reads1, reads2, nthreads, STARindex, samplename): +def runSTAR(reads1, reads2, nthreads, STARindex, samplename, allowmultimap): if not os.path.exists('STAR'): os.mkdir('STAR') @@ -34,13 +39,23 @@ def runSTAR(reads1, reads2, nthreads, STARindex, samplename): os.mkdir(outdir) prefix = outdir + '/' + samplename - if reads2: - command = ['STAR', '--runMode', 'alignReads', '--runThreadN', nthreads, '--genomeLoad', 'NoSharedMemory', '--genomeDir', STARindex, '--readFilesIn', reads1, reads2, '--readFilesCommand', - 'zcat', '--outFileNamePrefix', prefix, '--outSAMtype', 'BAM', 'SortedByCoordinate', '--outSAMstrandField', 'intronMotif', '--outSAMattributes', 'MD', 'NH', '-–outFilterMultimapNmax', '1'] + if not allowmultimap: + if reads2: + command = ['STAR', '--runMode', 'alignReads', '--runThreadN', nthreads, '--genomeLoad', 'NoSharedMemory', '--genomeDir', STARindex, '--readFilesIn', reads1, reads2, '--readFilesCommand', + 'zcat', '--outFileNamePrefix', prefix, '--outSAMtype', 'BAM', 'SortedByCoordinate', '--outSAMstrandField', 'intronMotif', '--outSAMattributes', 'MD', 'NH', '-–outFilterMultimapNmax', '1'] - elif not reads2: - command = ['STAR', '--runMode', 'alignReads', '--runThreadN', nthreads, '--genomeLoad', 'NoSharedMemory', '--genomeDir', STARindex, '--readFilesIn', reads1, '--readFilesCommand', - 'zcat', '--outFileNamePrefix', prefix, '--outSAMtype', 'BAM', 'SortedByCoordinate', '--outSAMstrandField', 'intronMotif', '--outSAMattributes', 'MD', 'NH', '-–outFilterMultimapNmax', '1'] + elif not reads2: + command = ['STAR', '--runMode', 'alignReads', '--runThreadN', nthreads, '--genomeLoad', 'NoSharedMemory', '--genomeDir', STARindex, '--readFilesIn', reads1, '--readFilesCommand', + 'zcat', '--outFileNamePrefix', prefix, '--outSAMtype', 'BAM', 'SortedByCoordinate', '--outSAMstrandField', 'intronMotif', '--outSAMattributes', 'MD', 'NH', '-–outFilterMultimapNmax', '1'] + + elif allowmultimap: + if reads2: + command = ['STAR', '--runMode', 'alignReads', '--runThreadN', nthreads, '--genomeLoad', 'NoSharedMemory', '--genomeDir', STARindex, '--readFilesIn', reads1, reads2, '--readFilesCommand', + 'zcat', '--outFileNamePrefix', prefix, '--outSAMtype', 'BAM', 'SortedByCoordinate', '--outSAMstrandField', 'intronMotif', '--outSAMmultNmax', '1', '--outSAMattributes', 'MD', 'NH'] + + elif not reads2: + command = ['STAR', '--runMode', 'alignReads', '--runThreadN', nthreads, '--genomeLoad', 'NoSharedMemory', '--genomeDir', STARindex, '--readFilesIn', reads1, '--readFilesCommand', + 'zcat', '--outFileNamePrefix', prefix, '--outSAMtype', 'BAM', 'SortedByCoordinate', '--outSAMstrandField', 'intronMotif', '--outSAMmultNmax', '1', '--outSAMattributes', 'MD', 'NH'] print('Running STAR for {0}...'.format(samplename)) @@ -117,11 +132,6 @@ def runSalmon(reads1, reads2, nthreads, salmonindex, samplename): movedquantfile = os.path.join(os.getcwd(), '{0}.quant.sf'.format(samplename)) os.rename(quantfile, movedquantfile) - #Remove uniquely aligning read files - os.remove(r1) - if reads2: - os.remove(r2) - print('Finished salmon for {0}!'.format(samplename)) def runPostmaster(samplename, nthreads): @@ -170,6 +180,7 @@ def addMD(samplename, reffasta, nthreads): parser.add_argument('--STARindex', type = str, help = 'STAR index directory.') parser.add_argument('--salmonindex', type = str, help = 'Salmon index directory.') parser.add_argument('--samplename', type = str, help = 'Sample name. Will be appended to output files.') + parser.add_argument('--allowmultimap', action = 'store_true', help = 'Consider multimapping reads in alignments and quantifications?' ) args = parser.parse_args() r1 = os.path.abspath(args.forwardreads) @@ -181,6 +192,7 @@ def addMD(samplename, reffasta, nthreads): salmonindex = os.path.abspath(args.salmonindex) samplename = args.samplename nthreads = args.nthreads + allowmultimap = args.allowmultimap wd = os.path.abspath(os.getcwd()) sampledir = os.path.join(wd, samplename) @@ -189,16 +201,22 @@ def addMD(samplename, reffasta, nthreads): os.mkdir(sampledir) os.chdir(sampledir) - #uniquely aligning read files - uniquer1 = samplename + '.unique.r1.fq.gz' - if args.reversereads: - uniquer2 = samplename + '.unique.r2.fq.gz' - elif not args.reversereads: - uniquer2 = None - - runSTAR(r1, r2, nthreads, STARindex, samplename) - bamtofastq(samplename, nthreads, r2) - runSalmon(uniquer1, uniquer2, nthreads, salmonindex, samplename) + runSTAR(r1, r2, nthreads, STARindex, samplename, allowmultimap) + if not allowmultimap: + #uniquely aligning read files + uniquer1 = samplename + '.unique.r1.fq.gz' + if args.reversereads: + uniquer2 = samplename + '.unique.r2.fq.gz' + elif not args.reversereads: + uniquer2 = None + bamtofastq(samplename, nthreads, r2) + runSalmon(uniquer1, uniquer2, nthreads, salmonindex, samplename) + #Remove uniquely aligning read files + os.remove(uniquer1) + if args.reversereads: + os.remove(uniquer2) + elif allowmultimap: + runSalmon(r1, r2, nthreads, salmonindex, samplename) os.chdir(sampledir) runPostmaster(samplename, nthreads) From 618ced1cab3e166b1d6d20aa5894edf0e6cf5aa8 Mon Sep 17 00:00:00 2001 From: Matthew Taliaferro Date: Thu, 27 Jul 2023 13:33:14 -0600 Subject: [PATCH 059/108] update readme image --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 4d1a170..c72dc69 100644 --- a/README.md +++ b/README.md @@ -26,7 +26,7 @@ Uniquely aligned reads are then extracted and used to quantify transcript abunda Following the creation of alignment files produced by `STAR` and `postmaster` as well as transcript quantifications produced by `salmon`, these files are then used by `pigpen.py` to identify nucleotide conversions, assign them to transcripts and genes, and then quantify the number of conversions in each gene. A graphical overview of the flow of `PIGPEN` is shown below. -![alt text](https://images.squarespace-cdn.com/content/v1/591d9c8cbebafbf01b1e28f9/f4a15b89-b3f1-4a10-84fc-5e669594f4e4/updatedPIGPENscheme.png?format=1500w "PIGPEN overview") +![alt text](https://images.squarespace-cdn.com/content/v1/591d9c8cbebafbf01b1e28f9/77d2062a-a31e-41b9-90ad-5963f618c6a6/updatedPIGPENscheme.png?format=1000w "PIGPEN overview") ## Requirements From 7718087742c095445ce362551f12cabab95ce6fd Mon Sep 17 00:00:00 2001 From: Matthew Taliaferro Date: Wed, 1 Nov 2023 14:35:23 -0600 Subject: [PATCH 060/108] change required tx overlap length in assignreads.py --- assignreads.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/assignreads.py b/assignreads.py index b399804..d789828 100644 --- a/assignreads.py +++ b/assignreads.py @@ -54,7 +54,7 @@ def processOverlaps(overlaps, numpairs): txs = overlaps[read] maxtx = max(txs, key = txs.get) overlaplength = txs[maxtx] #can implement minimum overlap here - if overlaplength >= 225: + if overlaplength >= 80: gene = maxtx.split('_')[0] read2gene[read] = gene From bc7a2ae59e50cfec5fc2873ce40cd0de92c1e2c9 Mon Sep 17 00:00:00 2001 From: Matthew Taliaferro Date: Wed, 1 Nov 2023 14:37:45 -0600 Subject: [PATCH 061/108] add maxmap parameter to alignAndQuant.py --- alignAndQuant.py | 105 +++++++++++++++++++++++++++-------------------- 1 file changed, 61 insertions(+), 44 deletions(-) diff --git a/alignAndQuant.py b/alignAndQuant.py index 7760a4d..8320c4b 100644 --- a/alignAndQuant.py +++ b/alignAndQuant.py @@ -3,6 +3,7 @@ import sys import shutil import argparse +import pysam #Given a pair of read files, align reads using STAR and quantify/align reads using salmon. #This will make a STAR-produced bam (for pigpen mutation calling) and a salmon-produced bam (for read assignment). @@ -18,14 +19,11 @@ #the salmon output is .quant.sf and .salmon.bam in salmon/, #and the postmaster output is .postmaster.bam in postmaster/ -#If --allowmultimap is provided, then all reads are given to salmon for quantification. If it is not provided, -#then only uniquely aligning reads are written to the STAR alignment and later provided to salmon for quantification. - -#Keep in mind that pigpen.py has a minimum quality score filter that can also be utilized later for filtering multimapping reads. +#maxmap parameter can be used for filtering multimapping reads #Requires STAR, salmon(>= 1.9.0), and postmaster be in user's PATH. -def runSTAR(reads1, reads2, nthreads, STARindex, samplename, allowmultimap): +def runSTAR(reads1, reads2, nthreads, STARindex, samplename): if not os.path.exists('STAR'): os.mkdir('STAR') @@ -39,23 +37,14 @@ def runSTAR(reads1, reads2, nthreads, STARindex, samplename, allowmultimap): os.mkdir(outdir) prefix = outdir + '/' + samplename - if not allowmultimap: - if reads2: - command = ['STAR', '--runMode', 'alignReads', '--runThreadN', nthreads, '--genomeLoad', 'NoSharedMemory', '--genomeDir', STARindex, '--readFilesIn', reads1, reads2, '--readFilesCommand', - 'zcat', '--outFileNamePrefix', prefix, '--outSAMtype', 'BAM', 'SortedByCoordinate', '--outSAMstrandField', 'intronMotif', '--outSAMattributes', 'MD', 'NH', '-–outFilterMultimapNmax', '1'] - - elif not reads2: - command = ['STAR', '--runMode', 'alignReads', '--runThreadN', nthreads, '--genomeLoad', 'NoSharedMemory', '--genomeDir', STARindex, '--readFilesIn', reads1, '--readFilesCommand', - 'zcat', '--outFileNamePrefix', prefix, '--outSAMtype', 'BAM', 'SortedByCoordinate', '--outSAMstrandField', 'intronMotif', '--outSAMattributes', 'MD', 'NH', '-–outFilterMultimapNmax', '1'] - elif allowmultimap: - if reads2: - command = ['STAR', '--runMode', 'alignReads', '--runThreadN', nthreads, '--genomeLoad', 'NoSharedMemory', '--genomeDir', STARindex, '--readFilesIn', reads1, reads2, '--readFilesCommand', - 'zcat', '--outFileNamePrefix', prefix, '--outSAMtype', 'BAM', 'SortedByCoordinate', '--outSAMstrandField', 'intronMotif', '--outSAMmultNmax', '1', '--outSAMattributes', 'MD', 'NH'] + if reads2: + command = ['STAR', '--runMode', 'alignReads', '--runThreadN', nthreads, '--genomeLoad', 'NoSharedMemory', '--genomeDir', STARindex, '--readFilesIn', reads1, reads2, '--readFilesCommand', + 'zcat', '--outFileNamePrefix', prefix, '--outSAMtype', 'BAM', 'SortedByCoordinate', '--outSAMstrandField', 'intronMotif', '--outSAMmultNmax', '1', '--outSAMattributes', 'MD', 'NH'] - elif not reads2: - command = ['STAR', '--runMode', 'alignReads', '--runThreadN', nthreads, '--genomeLoad', 'NoSharedMemory', '--genomeDir', STARindex, '--readFilesIn', reads1, '--readFilesCommand', - 'zcat', '--outFileNamePrefix', prefix, '--outSAMtype', 'BAM', 'SortedByCoordinate', '--outSAMstrandField', 'intronMotif', '--outSAMmultNmax', '1', '--outSAMattributes', 'MD', 'NH'] + elif not reads2: + command = ['STAR', '--runMode', 'alignReads', '--runThreadN', nthreads, '--genomeLoad', 'NoSharedMemory', '--genomeDir', STARindex, '--readFilesIn', reads1, '--readFilesCommand', + 'zcat', '--outFileNamePrefix', prefix, '--outSAMtype', 'BAM', 'SortedByCoordinate', '--outSAMstrandField', 'intronMotif', '--outSAMmultNmax', '1', '--outSAMattributes', 'MD', 'NH'] print('Running STAR for {0}...'.format(samplename)) @@ -71,15 +60,43 @@ def runSTAR(reads1, reads2, nthreads, STARindex, samplename, allowmultimap): print('Finished STAR for {0}!'.format(samplename)) +def filterbam(samplename, maxmap): + #Take a bam and filter it, only keeping reads that map to <= maxmap locations using NH:i tag + #For some reason whether STAR uses --outFilterMultiMapNmax is unpredictably variable, so we will do it this way. + cwd = os.getcwd() + outdir = os.path.join(cwd, 'STAR') + maxmap = int(maxmap) + inbam = os.path.join(outdir, samplename + 'Aligned.sortedByCoord.out.bam') + outbam = os.path.join(outdir, samplename + 'Aligned.sortedByCoord.multifiltered.out.bam') + + print('Removing reads with > {0} alignments...'.format(maxmap)) + + with pysam.AlignmentFile(inbam, 'rb') as infh, pysam.AlignmentFile(outbam, 'wb', template = infh) as outfh: + readcount = 0 + filteredreadcount = 0 + for read in infh.fetch(until_eof = True): + readcount +=1 + nh = read.get_tag('NH') + if nh <= maxmap: + filteredreadcount +=1 + outfh.write(read) + + #Remove unfiltered bam + os.remove(inbam) + + filteredpct = round((filteredreadcount / readcount) * 100, 3) + + print('Looked through {0} reads. {1} ({2}%) had {3} or fewer alignments.'.format(readcount, filteredreadcount, filteredpct, maxmap)) + def bamtofastq(samplename, nthreads, reads2): - #Given a bam file of uniquely aligned reads (produced from runSTAR), rederive these reads as fastq in preparation for submission to salmon + #Given a bam file of aligned reads (produced from runSTAR), rederive these reads as fastq in preparation for submission to salmon if not os.path.exists('STAR'): os.mkdir('STAR') cwd = os.getcwd() outdir = os.path.join(cwd, 'STAR') - inbam = os.path.join(outdir, samplename + 'Aligned.sortedByCoord.out.bam') + inbam = os.path.join(outdir, samplename + 'Aligned.sortedByCoord.multifiltered.out.bam') sortedbam = os.path.join(outdir, 'temp.namesort.bam') #First sort bam file by readname @@ -89,9 +106,9 @@ def bamtofastq(samplename, nthreads, reads2): print('Done!') #Now derive fastq - r1file = samplename + '.unique.r1.fq.gz' - r2file = samplename + '.unique.r2.fq.gz' - print('Writing fastq file of uniquely aligned reads for {0}...'.format(samplename)) + r1file = samplename + '.aligned.r1.fq.gz' + r2file = samplename + '.aligned.r2.fq.gz' + print('Writing fastq file of aligned reads for {0}...'.format(samplename)) if reads2: command = ['samtools', 'fastq', '--threads', nthreads, '-1', r1file, '-2', r2file, '-0', '/dev/null', '-s', '/dev/null', '-n', sortedbam] elif not reads2: @@ -103,7 +120,7 @@ def bamtofastq(samplename, nthreads, reads2): def runSalmon(reads1, reads2, nthreads, salmonindex, samplename): - #Take in those uniquely aligning reads and quantify transcript abundance with them using salmon. + #Take in those aligning reads and quantify transcript abundance with them using salmon. if not os.path.exists('salmon'): os.mkdir('salmon') @@ -180,7 +197,7 @@ def addMD(samplename, reffasta, nthreads): parser.add_argument('--STARindex', type = str, help = 'STAR index directory.') parser.add_argument('--salmonindex', type = str, help = 'Salmon index directory.') parser.add_argument('--samplename', type = str, help = 'Sample name. Will be appended to output files.') - parser.add_argument('--allowmultimap', action = 'store_true', help = 'Consider multimapping reads in alignments and quantifications?' ) + parser.add_argument('--maxmap', type = int, help = 'Maximum number of allowable alignments for a read.') args = parser.parse_args() r1 = os.path.abspath(args.forwardreads) @@ -192,7 +209,7 @@ def addMD(samplename, reffasta, nthreads): salmonindex = os.path.abspath(args.salmonindex) samplename = args.samplename nthreads = args.nthreads - allowmultimap = args.allowmultimap + maxmap = args.maxmap wd = os.path.abspath(os.getcwd()) sampledir = os.path.join(wd, samplename) @@ -201,22 +218,22 @@ def addMD(samplename, reffasta, nthreads): os.mkdir(sampledir) os.chdir(sampledir) - runSTAR(r1, r2, nthreads, STARindex, samplename, allowmultimap) - if not allowmultimap: - #uniquely aligning read files - uniquer1 = samplename + '.unique.r1.fq.gz' - if args.reversereads: - uniquer2 = samplename + '.unique.r2.fq.gz' - elif not args.reversereads: - uniquer2 = None - bamtofastq(samplename, nthreads, r2) - runSalmon(uniquer1, uniquer2, nthreads, salmonindex, samplename) - #Remove uniquely aligning read files - os.remove(uniquer1) - if args.reversereads: - os.remove(uniquer2) - elif allowmultimap: - runSalmon(r1, r2, nthreads, salmonindex, samplename) + runSTAR(r1, r2, nthreads, STARindex, samplename) + filterbam(samplename, maxmap) + + #aligned read files + alignedr1 = samplename + '.aligned.r1.fq.gz' + if args.reversereads: + alignedr2 = samplename + '.aligned.r2.fq.gz' + elif not args.reversereads: + alignedr2 = None + bamtofastq(samplename, nthreads, r2) + runSalmon(alignedr1, alignedr2, nthreads, salmonindex, samplename) + #Remove aligned fastqs + os.chdir(sampledir) + os.remove(alignedr1) + if args.reversereads: + os.remove(alignedr2) os.chdir(sampledir) runPostmaster(samplename, nthreads) From c23432db746de4fd945fb9fa2a6a3da34b9d3184 Mon Sep 17 00:00:00 2001 From: Matthew Taliaferro Date: Wed, 6 Dec 2023 14:00:55 -0700 Subject: [PATCH 062/108] removed necessity of dedup UMI in pigpen.py --- pigpen.py | 14 ++++---------- 1 file changed, 4 insertions(+), 10 deletions(-) diff --git a/pigpen.py b/pigpen.py index e939eea..9382311 100644 --- a/pigpen.py +++ b/pigpen.py @@ -9,6 +9,7 @@ from maskpositions import readmaskbed from getmismatches import iteratereads_pairedend, getmismatches from assignreads_salmon import getpostmasterassignments, assigntotxs, collapsetogene, readspergene, writeOutput +#from assignreads_salmon_ensembl import getpostmasterassignments, assigntotxs, collapsetogene, readspergene, writeOutput from assignreads import getReadOverlaps, processOverlaps from conversionsPerGene import getPerGene, writeConvsPerGene @@ -34,7 +35,6 @@ parser.add_argument('--minMappingQual', type = int, help = 'Minimum mapping quality for a read to be considered in conversion counting. STAR unique mappers have MAPQ 255.', required = True) parser.add_argument('--nConv', type = int, help = 'Minimum number of required G->T and/or G->C conversions in a read pair in order for those conversions to be counted. Default is 1.', default = 1) parser.add_argument('--outputDir', type = str, help = 'Output directory.', required = True) - parser.add_argument('--dedupUMI', action = 'store_true', help = 'Use deduplicated UMIs? Requires --dedupUMI to have been supplied to alignandquant.py.') args = parser.parse_args() #If we have single end data, considering overlap of paired reads or only one read doesn't make sense @@ -54,7 +54,6 @@ samplenames = args.samplenames.split(',') salmonquants = [os.path.join(x, 'salmon', '{0}.quant.sf'.format(x)) for x in samplenames] starbams = [os.path.join(x, 'STAR', '{0}Aligned.sortedByCoord.out.bam'.format(x)) for x in samplenames] #non-deduplicated bams - dedupbams = [os.path.join(x, 'STAR', '{0}.dedup.bam'.format(x)) for x in samplenames] #deduplicated bams postmasterbams = [os.path.join(x, 'postmaster', '{0}.postmaster.bam'.format(x)) for x in samplenames] #Take in list of control samples, make list of their corresponding star bams for SNP calling @@ -64,14 +63,9 @@ samplebams = [] controlsamplebams = [] for ind, x in enumerate(samplenames): - if args.dedupUMI: - samplebams.append(dedupbams[ind]) - if args.controlsamples and x in controlsamples: - controlsamplebams.append(dedupbams[ind]) - else: - samplebams.append(starbams[ind]) - if args.controlsamples and x in controlsamples: - controlsamplebams.append(starbams[ind]) + samplebams.append(starbams[ind]) + if args.controlsamples and x in controlsamples: + controlsamplebams.append(starbams[ind]) #We have to be either looking for G->T or G->C, if not both if not args.use_g_t and not args.use_g_c: From 8d0026d61dfad680b65e651e63bf3151a5c94f44 Mon Sep 17 00:00:00 2001 From: Matthew Taliaferro Date: Wed, 6 Dec 2023 14:02:43 -0700 Subject: [PATCH 063/108] alignandquant unfiltered bam is no longer kept --- alignAndQuant.py | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/alignAndQuant.py b/alignAndQuant.py index 8320c4b..5636dd6 100644 --- a/alignAndQuant.py +++ b/alignAndQuant.py @@ -81,8 +81,17 @@ def filterbam(samplename, maxmap): filteredreadcount +=1 outfh.write(read) - #Remove unfiltered bam + #Remove unfiltered bam and its index os.remove(inbam) + os.remove(inbam + '.bai') + #Rename filtered bam so that it has the same name as the original + #This helps later when pipgen is looking for bams with certain expected names + os.rename(outbam, inbam) + #index filtered bam + bamindex = inbam + '.bai' + indexCMD = 'samtools index ' + inbam + index = subprocess.Popen(indexCMD, shell=True) + index.wait() filteredpct = round((filteredreadcount / readcount) * 100, 3) @@ -96,7 +105,7 @@ def bamtofastq(samplename, nthreads, reads2): cwd = os.getcwd() outdir = os.path.join(cwd, 'STAR') - inbam = os.path.join(outdir, samplename + 'Aligned.sortedByCoord.multifiltered.out.bam') + inbam = os.path.join(outdir, samplename + 'Aligned.sortedByCoord.out.bam') sortedbam = os.path.join(outdir, 'temp.namesort.bam') #First sort bam file by readname From 77af14d98a79cae939b028336206cb5205d8cb09 Mon Sep 17 00:00:00 2001 From: Matthew Taliaferro Date: Wed, 6 Dec 2023 14:03:30 -0700 Subject: [PATCH 064/108] alignUMIquant allow filtering by num of alignments --- alignUMIquant.py | 113 +++++++++++++++++++++++++++++++++++------------ 1 file changed, 84 insertions(+), 29 deletions(-) diff --git a/alignUMIquant.py b/alignUMIquant.py index feafb2f..cc0108c 100644 --- a/alignUMIquant.py +++ b/alignUMIquant.py @@ -3,6 +3,8 @@ import sys import shutil import argparse +import pysam + ''' Given a pair of read files, align reads using STAR, deduplicate reads by UMI, and quantify reads using salmon. This will make a STAR-produced bam (for pigpen mutation calling) @@ -62,6 +64,46 @@ def runSTAR(reads1, reads2, nthreads, STARindex, samplename): print('Finished STAR for {0}!'.format(samplename)) +def filterbam(samplename, maxmap): + #Take a bam and filter it, only keeping reads that map to <= maxmap locations using NH:i tag + #For some reason whether STAR uses --outFilterMultiMapNmax is unpredictably variable, so we will do it this way. + cwd = os.getcwd() + outdir = os.path.join(cwd, 'STAR') + maxmap = int(maxmap) + inbam = os.path.join(outdir, samplename + 'Aligned.sortedByCoord.out.bam') + outbam = os.path.join(outdir, samplename + + 'Aligned.sortedByCoord.multifiltered.out.bam') + + print('Removing reads with > {0} alignments...'.format(maxmap)) + + with pysam.AlignmentFile(inbam, 'rb') as infh, pysam.AlignmentFile(outbam, 'wb', template=infh) as outfh: + readcount = 0 + filteredreadcount = 0 + for read in infh.fetch(until_eof=True): + readcount += 1 + nh = read.get_tag('NH') + if nh <= maxmap: + filteredreadcount += 1 + outfh.write(read) + + #Remove unfiltered bam and its index + os.remove(inbam) + os.remove(inbam + '.bai') + #Rename filtered bam so that it has the same name as the original + #This helps later when pipgen is looking for bams with certain expected names + os.rename(outbam, inbam) + #index filtered bam + bamindex = inbam + '.bai' + indexCMD = 'samtools index ' + inbam + index = subprocess.Popen(indexCMD, shell=True) + index.wait() + + filteredpct = round((filteredreadcount / readcount) * 100, 3) + + print('Looked through {0} reads. {1} ({2}%) had {3} or fewer alignments.'.format( + readcount, filteredreadcount, filteredpct, maxmap)) + + def runDedup(samplename, nthreads): STARbam = os.path.join(os.getcwd(), 'STAR', '{0}Aligned.sortedByCoord.out.bam'.format(samplename)) dedupbam = os.path.join(os.getcwd(), 'STAR', '{0}.dedup.bam'.format(samplename)) @@ -79,22 +121,25 @@ def runDedup(samplename, nthreads): subprocess.run(command) #We don't need the STAR alignment file anymore, and it's pretty big - os.remove(STARbam) + #Rename to the old name so downstream code finds the bams it's looking for + os.rename(dedupbam, STARbam) + #Reindex deduplicated bam + bamindex = STARbam + '.bai' + indexCMD = 'samtools index ' + STARbam + index = subprocess.Popen(indexCMD, shell=True) + index.wait() print('Finished deduplicating {0}!'.format(samplename)) -def bamtofastq(samplename, nthreads, dedup): +def bamtofastq(samplename, nthreads, dedup, reads2): #Given a bam file of uniquely aligned reads (produced from runSTAR), rederive these reads as fastq in preparation for submission to salmon if not os.path.exists('STAR'): os.mkdir('STAR') cwd = os.getcwd() outdir = os.path.join(cwd, 'STAR') - if dedup: - inbam = os.path.join(outdir, samplename + '.dedup.bam') - else: - inbam = os.path.join(outdir, samplename + 'Aligned.sortedByCoord.out.bam') + inbam = os.path.join(outdir, samplename + 'Aligned.sortedByCoord.out.bam') sortedbam = os.path.join(outdir, 'temp.namesort.bam') #First sort bam file by readname @@ -104,16 +149,19 @@ def bamtofastq(samplename, nthreads, dedup): print('Done!') #Now derive fastq + r1file = samplename + '.aligned.r1.fq.gz' + r2file = samplename + '.aligned.r2.fq.gz' if dedup: - r1file = samplename + '.dedup.r1.fq.gz' - r2file = samplename + '.dedup.r2.fq.gz' - else: - r1file = samplename + '.STARaligned.r1.fq.gz' - r2file = samplename + '.STARaligned.r2.fq.gz' - print('Writing fastq file of deduplicated reads for {0}...'.format(samplename)) - command = ['samtools', 'fastq', '--threads', nthreads, '-1', r1file, '-2', r2file, '-0', '/dev/null', '-s', '/dev/null', '-n', sortedbam] + print('Writing fastq file of deduplicated reads for {0}...'.format(samplename)) + elif not dedup: + print('Writing fastq file of aligned reads for {0}...'.format(samplename)) + if reads2: + command = ['samtools', 'fastq', '--threads', nthreads, '-1', r1file, '-2', r2file, '-0', '/dev/null', '-s', '/dev/null', '-n', sortedbam] + elif not reads2: + command = ['samtools', 'fastq', '--threads', nthreads, '-0', r1file, '-n', sortedbam] subprocess.call(command) print('Done writing fastq files for {0}!'.format(samplename)) + os.remove(sortedbam) @@ -144,7 +192,8 @@ def runSalmon(reads1, reads2, nthreads, salmonindex, samplename): #Remove uniquely aligning read files os.remove(r1) - os.remove(r2) + if reads2: + os.remove(r2) print('Finished salmon for {0}!'.format(samplename)) @@ -197,13 +246,20 @@ def addMD(samplename, reffasta, nthreads): parser.add_argument('--samplename', type = str, help = 'Sample name. Will be appended to output files.') parser.add_argument('--dedupUMI', action = 'store_true', help = 'Deduplicate UMIs? requires UMI extract.') parser.add_argument('--libType', type = str, help = 'Library Type, either "LEXO" or "SA"') + parser.add_argument( + '--maxmap', type=int, help='Maximum number of allowable alignments for a read.') args = parser.parse_args() r1 = os.path.abspath(args.forwardreads) - r2 = os.path.abspath(args.reversereads) + if args.reversereads: + r2 = os.path.abspath(args.reversereads) + elif not args.reversereads: + r2 = None STARindex = os.path.abspath(args.STARindex) + salmonindex = os.path.abspath(args.salmonindex) samplename = args.samplename nthreads = args.nthreads + maxmap = args.maxmap wd = os.path.abspath(os.getcwd()) sampledir = os.path.join(wd, samplename) @@ -214,20 +270,19 @@ def addMD(samplename, reffasta, nthreads): runSTAR(r1, r2, nthreads, STARindex, samplename) + filterbam(samplename, maxmap) if args.dedupUMI: runDedup(samplename, nthreads) - if args.libType == "LEXO": - salmonindex = os.path.abspath(args.salmonindex) - #uniquely aligning or deduplicated read files - if args.dedupUMI: - salmonR1 = samplename + '.dedup.r1.fq.gz' - salmonR2 = samplename + '.dedup.r2.fq.gz' - else: - salmonR1 = samplename + '.STARaligned.r1.fq.gz' - salmonR2 = samplename + '.STARaligned.r2.fq.gz' - - bamtofastq(samplename, nthreads, args.dedupUMI) - runSalmon(salmonR1, salmonR2, nthreads, salmonindex, samplename) - os.chdir(sampledir) - runPostmaster(samplename, nthreads) \ No newline at end of file + #aligned read files + alignedr1 = samplename + '.aligned.r1.fq.gz' + if args.reversereads: + alignedr2 = samplename + '.aligned.r2.fq.gz' + elif not args.reversereads: + alignedr2 = None + bamtofastq(samplename, nthreads, args.dedupUMI, r2) + runSalmon(alignedr1, alignedr2, nthreads, salmonindex, samplename) + #Remove aligned fastqs + os.chdir(sampledir) + os.chdir(sampledir) + runPostmaster(samplename, nthreads) From a54a40365852153db65ca049881e6f507a0e9c54 Mon Sep 17 00:00:00 2001 From: Matthew Taliaferro Date: Fri, 23 Feb 2024 16:32:18 -0700 Subject: [PATCH 065/108] fix typo in bacon_glm --- bacon_glm.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/bacon_glm.py b/bacon_glm.py index 033a522..78de2d6 100644 --- a/bacon_glm.py +++ b/bacon_glm.py @@ -391,7 +391,7 @@ def formatporcdf(porcdf): #What metric should we care about? if args.considernonG: - metric == 'porc' + metric = 'porc' elif args.use_g_t and not args.use_g_c: metric = 'G_Trate' elif args.use_g_c and not args.use_g_t: From 1cf6b8ee9c7589a420e5e4b3ea60c610bb9509ac Mon Sep 17 00:00:00 2001 From: Matthew Taliaferro Date: Mon, 26 Feb 2024 11:04:39 -0700 Subject: [PATCH 066/108] add ability to quantify deletions in query --- assignreads_salmon.py | 50 ++++--- conversionsPerGene.py | 49 ++++--- getmismatches.py | 319 ++++++++++++++++++++++++------------------ pigpen.py | 19 +-- 4 files changed, 258 insertions(+), 179 deletions(-) diff --git a/assignreads_salmon.py b/assignreads_salmon.py index 65f5c69..906b211 100644 --- a/assignreads_salmon.py +++ b/assignreads_salmon.py @@ -108,7 +108,8 @@ def collapsetogene(txconvs, gff): 'a_a', 'a_t', 'a_c', 'a_g', 'a_n', 'g_a', 'g_t', 'g_c', 'g_g', 'g_n', 'c_a', 'c_t', 'c_c', 'c_g', 'c_n', - 't_a', 't_t', 't_c', 't_g', 't_n'] + 't_a', 't_t', 't_c', 't_g', 't_n', + 'a_x', 'g_x', 'c_x', 't_x', 'ng_xg'] for gene in allgenes: geneconvs[gene] = {} @@ -161,22 +162,23 @@ def readspergene(quantsf, tx2gene): return genecounts -def writeOutput(sampleparams, geneconvs, genecounts, geneid2genename, outfile, use_g_t, use_g_c): +def writeOutput(sampleparams, geneconvs, genecounts, geneid2genename, outfile, use_g_t, use_g_c, use_g_x, use_ng_xg): #Write number of conversions and readcounts for genes. possibleconvs = [ 'a_a', 'a_t', 'a_c', 'a_g', 'a_n', 'g_a', 'g_t', 'g_c', 'g_g', 'g_n', 'c_a', 'c_t', 'c_c', 'c_g', 'c_n', - 't_a', 't_t', 't_c', 't_g', 't_n'] + 't_a', 't_t', 't_c', 't_g', 't_n', + 'a_x', 'g_x', 'c_x', 't_x', 'ng_xg'] with open(outfile, 'w') as outfh: #Write arguments for this pigpen run for arg in sampleparams: outfh.write('#' + arg + '\t' + str(sampleparams[arg]) + '\n') #total G is number of ref Gs encountered - #convG is g_t + g_c (the ones we are interested in) + #convG is g_t + g_c + g_x + ng_xg (the ones we are interested in) outfh.write(('\t').join(['GeneID', 'GeneName', 'numreads'] + possibleconvs + [ - 'totalG', 'convG', 'convGrate', 'G_Trate', 'G_Crate', 'porc']) + '\n') + 'totalG', 'convG', 'convGrate', 'G_Trate', 'G_Crate', 'G_Xrate', 'NG_XGrate', 'porc']) + '\n') genes = sorted(geneconvs.keys()) for gene in genes: @@ -190,21 +192,19 @@ def writeOutput(sampleparams, geneconvs, genecounts, geneid2genename, outfile, u convcounts = ['{:.2f}'.format(x) for x in convcounts] - totalG = c['g_g'] + c['g_c'] + c['g_t'] + c['g_a'] + c['g_n'] - if use_g_t and use_g_c: - convG = c['g_c'] + c['g_t'] - elif use_g_c and not use_g_t: - convG = c['g_c'] - elif use_g_t and not use_g_c: - convG = c['g_t'] - elif not use_g_t and not use_g_c: - print('ERROR: we have to be counting either G->T or G->C, if not both!') - sys.exit() - + totalG = c['g_g'] + c['g_c'] + c['g_t'] + c['g_a'] + c['g_n'] + c['g_x'] + convG = 0 + possiblegconv = ['g_t', 'g_c', 'g_x', 'ng_xg'] + for ind, x in enumerate([use_g_t, use_g_c, use_g_x, use_ng_xg]): + if x == True: + convG += c[possiblegconv[ind]] + g_ccount = c['g_c'] g_tcount = c['g_t'] + g_xcount = c['g_x'] + ng_xgcount = c['ng_xg'] - totalmut = c['a_t'] + c['a_c'] + c['a_g'] + c['g_t'] + c['g_c'] + c['g_a'] + c['t_a'] + c['t_c'] + c['t_g'] + c['c_t'] + c['c_g'] + c['c_a'] + totalmut = c['a_t'] + c['a_c'] + c['a_g'] + c['g_t'] + c['g_c'] + c['g_a'] + c['t_a'] + c['t_c'] + c['t_g'] + c['c_t'] + c['c_g'] + c['c_a'] + c['g_x'] + c['ng_xg'] totalnonmut = c['a_a'] + c['g_g'] + c['c_c'] + c['t_t'] allnt = totalmut + totalnonmut @@ -223,6 +223,16 @@ def writeOutput(sampleparams, geneconvs, genecounts, geneid2genename, outfile, u except ZeroDivisionError: g_trate = 'NA' + try: + g_xrate = g_xcount / totalG + except ZeroDivisionError: + g_xrate = 'NA' + + try: + ng_xgrate = ng_xgcount / totalG + except ZeroDivisionError: + ng_xgrate = 'NA' + try: totalmutrate = totalmut / allnt except ZeroDivisionError: @@ -253,10 +263,14 @@ def writeOutput(sampleparams, geneconvs, genecounts, geneid2genename, outfile, u g_trate = '{:.2e}'.format(g_trate) if type(g_crate) == float: g_crate = '{:.2e}'.format(g_crate) + if type(g_xrate) == float: + g_xrate = '{:.2e}'.format(g_xrate) + if type(ng_xgrate) == float: + ng_xgrate = '{:.2e}'.format(ng_xgrate) if type(porc) == np.float64: porc = '{:.3f}'.format(porc) - outfh.write(('\t').join([gene, genename, str(numreads)] + convcounts + [str(totalG), str(convG), str(convGrate), str(g_trate), str(g_crate), str(porc)]) + '\n') + outfh.write(('\t').join([gene, genename, str(numreads)] + convcounts + [str(totalG), str(convG), str(convGrate), str(g_trate), str(g_crate), str(g_xrate), str(ng_xgrate), str(porc)]) + '\n') diff --git a/conversionsPerGene.py b/conversionsPerGene.py index 21bce0c..52627d1 100644 --- a/conversionsPerGene.py +++ b/conversionsPerGene.py @@ -24,7 +24,8 @@ def getPerGene(convs, reads2gene): 'a_a', 'a_t', 'a_c', 'a_g', 'a_n', 'g_a', 'g_t', 'g_c', 'g_g', 'g_n', 'c_a', 'c_t', 'c_c', 'c_g', 'c_n', - 't_a', 't_t', 't_c', 't_g', 't_n'] + 't_a', 't_t', 't_c', 't_g', 't_n', + 'a_x', 'g_x', 'c_x', 't_x', 'ng_xg'] #It's possible (but relatively rare) for a read to be in convs but #not in reads2gene (or vice versa). Filter for reads only present in both. @@ -36,7 +37,6 @@ def getPerGene(convs, reads2gene): convs = {key:value for (key, value) in convs.items() if key in commonreads} reads2gene = {key:value for (key, value) in reads2gene.items() if key in commonreads} - for gene in geneids: convsPerGene[gene] = {} for conv in possibleconvs: @@ -51,20 +51,21 @@ def getPerGene(convs, reads2gene): return numreadspergene, convsPerGene -def writeConvsPerGene(sampleparams, numreadspergene, convsPerGene, outfile, use_g_t, use_g_c): +def writeConvsPerGene(sampleparams, numreadspergene, convsPerGene, outfile, use_g_t, use_g_c, use_g_x, use_ng_xg): possibleconvs = [ 'a_a', 'a_t', 'a_c', 'a_g', 'a_n', 'g_a', 'g_t', 'g_c', 'g_g', 'g_n', 'c_a', 'c_t', 'c_c', 'c_g', 'c_n', - 't_a', 't_t', 't_c', 't_g', 't_n'] + 't_a', 't_t', 't_c', 't_g', 't_n', + 'a_x', 'g_x', 'c_x', 't_x', 'ng_xg'] with open(outfile, 'w') as outfh: #Write arguments for this pigpen run for arg in sampleparams: outfh.write('#' + arg + '\t' + str(sampleparams[arg]) + '\n') #total G is number of ref Gs encountered - #convG is g_t + g_c (the ones we are interested in) - outfh.write(('\t').join(['Gene', 'numreads'] + possibleconvs + ['totalG', 'convG', 'convGrate', 'G_Trate', 'G_Crate', 'porc']) + '\n') + #convG is g_t + g_c + g_x + ng_xg (the ones we are interested in) + outfh.write(('\t').join(['Gene', 'numreads'] + possibleconvs + ['totalG', 'convG', 'convGrate', 'G_Trate', 'G_Crate', 'G_Xrate', 'NG_XGrate', 'porc']) + '\n') genes = sorted(convsPerGene.keys()) for gene in genes: @@ -77,21 +78,19 @@ def writeConvsPerGene(sampleparams, numreadspergene, convsPerGene, outfile, use_ convcounts = [str(x) for x in convcounts] - totalG = c['g_g'] + c['g_c'] + c['g_t'] + c['g_a'] + c['g_n'] - if use_g_t and use_g_c: - convG = c['g_c'] + c['g_t'] - elif use_g_c and not use_g_t: - convG = c['g_c'] - elif use_g_t and not use_g_c: - convG = c['g_t'] - elif not use_g_t and not use_g_c: - print('ERROR: we have to be counting either G->T or G->C, if not both!') - sys.exit() + totalG = c['g_g'] + c['g_c'] + c['g_t'] + c['g_a'] + c['g_n'] + c['g_x'] + convG = 0 + possiblegconv = ['g_t', 'g_c', 'g_x', 'ng_xg'] + for ind, x in enumerate([use_g_t, use_g_c, use_g_x, use_ng_xg]): + if x == True: + convG += c[possiblegconv[ind]] g_ccount = c['g_c'] g_tcount = c['g_t'] + g_xcount = c['g_x'] + ng_xgcount = c['ng_xg'] - totalmut = c['a_t'] + c['a_c'] + c['a_g'] + c['g_t'] + c['g_c'] + c['g_a'] + c['t_a'] + c['t_c'] + c['t_g'] + c['c_t'] + c['c_g'] + c['c_a'] + totalmut = c['a_t'] + c['a_c'] + c['a_g'] + c['g_t'] + c['g_c'] + c['g_a'] + c['t_a'] + c['t_c'] + c['t_g'] + c['c_t'] + c['c_g'] + c['c_a'] + c['g_x'] + c['ng_xg'] totalnonmut = c['a_a'] + c['g_g'] + c['c_c'] + c['t_t'] allnt = totalmut + totalnonmut @@ -110,6 +109,16 @@ def writeConvsPerGene(sampleparams, numreadspergene, convsPerGene, outfile, use_ except ZeroDivisionError: g_trate = 'NA' + try: + g_xrate = g_xcount / totalG + except ZeroDivisionError: + g_xrate = 'NA' + + try: + ng_xgrate = ng_xgcount / totalG + except ZeroDivisionError: + ng_xgrate = 'NA' + try: totalmutrate = totalmut / allnt except ZeroDivisionError: @@ -134,10 +143,14 @@ def writeConvsPerGene(sampleparams, numreadspergene, convsPerGene, outfile, use_ g_trate = '{:.2e}'.format(g_trate) if type(g_crate) == float: g_crate = '{:.2e}'.format(g_crate) + if type(g_xrate) == float: + g_xrate = '{:.2e}'.format(g_xrate) + if type(ng_xgrate) == float: + ng_xgrate = '{:.2e}'.format(ng_xgrate) if type(porc) == np.float64: porc = '{:.3f}'.format(porc) - outfh.write(('\t').join([gene, str(numreads)] + convcounts + [str(totalG), str(convG), str(convGrate), str(g_trate), str(g_crate), str(porc)]) + '\n') + outfh.write(('\t').join([gene, str(numreads)] + convcounts + [str(totalG), str(convG), str(convGrate), str(g_trate), str(g_crate), str(g_xrate), str(ng_xgrate), str(porc)]) + '\n') if __name__ == '__main__': diff --git a/getmismatches.py b/getmismatches.py index a1428a6..707dd38 100644 --- a/getmismatches.py +++ b/getmismatches.py @@ -22,7 +22,10 @@ def revcomp(nt): 'c' : 'g', 'a' : 't', 't' : 'a', - 'n' : 'n' + 'n' : 'n', + 'X': 'X', + None: None, + 'NA': 'NA' } nt_rc = revcompdict[nt] @@ -207,7 +210,7 @@ def findsnps(controlbams, genomefasta, minCoverage = 20, minVarFreq = 0.02): return snps -def iteratereads_pairedend(bam, onlyConsiderOverlap, use_g_t, use_g_c, use_read1, use_read2, nConv, minMappingQual, snps=None, maskpositions=None, verbosity='high'): +def iteratereads_pairedend(bam, onlyConsiderOverlap, use_g_t, use_g_c, use_g_x, use_ng_xg, use_read1, use_read2, nConv, minMappingQual, minPhred=30, snps=None, maskpositions=None, verbosity='high'): #Iterate over reads in a paired end alignment file. #Find nt conversion locations for each read. #For locations interrogated by both mates of read pair, conversion must exist in both mates in order to count @@ -236,7 +239,7 @@ def iteratereads_pairedend(bam, onlyConsiderOverlap, use_g_t, use_g_c, use_read1 #Check mapping quality #MapQ is 255 for uniquely aligned reads FOR STAR ONLY - if read1.mapping_quality < minMappingQual or read2.mapping_quality < minMappingQual: + if read1.mapping_quality < int(minMappingQual) or read2.mapping_quality < int(minMappingQual): continue readcounter +=1 @@ -292,7 +295,7 @@ def iteratereads_pairedend(bam, onlyConsiderOverlap, use_g_t, use_g_c, use_read1 read1qualities = list(read1.query_qualities) #phred scores read2qualities = list(read2.query_qualities) - convs_in_read = getmismatches_pairedend(read1alignedpairs, read2alignedpairs, read1queryseq, read2queryseq, read1qualities, read2qualities, read1strand, read2strand, masklocations, onlyConsiderOverlap, nConv, use_g_t, use_g_c, use_read1, use_read2) + convs_in_read = getmismatches_pairedend(read1alignedpairs, read2alignedpairs, read1queryseq, read2queryseq, read1qualities, read2qualities, read1strand, read2strand, masklocations, onlyConsiderOverlap, nConv, minPhred, use_g_t, use_g_c, use_g_x, use_ng_xg, use_read1, use_read2) queriednts.append(sum(convs_in_read.values())) convs[queryname] = convs_in_read @@ -307,35 +310,43 @@ def iteratereads_pairedend(bam, onlyConsiderOverlap, use_g_t, use_g_c, use_read1 return convs, readcounter -def getmismatches_pairedend(read1alignedpairs, read2alignedpairs, read1queryseq, read2queryseq, read1qualities, read2qualities, read1strand, read2strand, masklocations, onlyoverlap, nConv, use_g_t, use_g_c, use_read1, use_read2): +def getmismatches_pairedend(read1alignedpairs, read2alignedpairs, read1queryseq, read2queryseq, read1qualities, read2qualities, read1strand, read2strand, masklocations, onlyoverlap, nConv, minPhred, use_g_t, use_g_c, use_g_x, use_ng_xg, use_read1, use_read2): #remove tuples that have None #These are either intronic or might have been soft-clipped #Tuples are (querypos, refpos, refsequence) #If there is a substitution, refsequence is lower case - #remove positions where querypos is None - #i'm pretty sure these query positions won't have quality scores - read1alignedpairs = [x for x in read1alignedpairs if x[0] != None] - read2alignedpairs = [x for x in read2alignedpairs if x[0] != None] - - #Add quality scores to alignedpairs tuples - #will now be (querypos, refpos, refsequence, qualityscore) - read1ap_withq = [] - for ind, x in enumerate(read1alignedpairs): - x += (read1qualities[ind],) - read1ap_withq.append(x) - read1alignedpairs = read1ap_withq - - read2ap_withq = [] - for ind, x in enumerate(read2alignedpairs): - x += (read2qualities[ind],) - read2ap_withq.append(x) - read2alignedpairs = read2ap_withq - - #Now remove positions where refsequence is None - #These may be places that got soft-clipped - read1alignedpairs = [x for x in read1alignedpairs if None not in x] - read2alignedpairs = [x for x in read2alignedpairs if None not in x] + #For now, forget insertions. Get rid of any position where reference position is None. + read1alignedpairs = [x for x in read1alignedpairs if x[1] != None] + read2alignedpairs = [x for x in read2alignedpairs if x[1] != None] + read1alignedpairs = [x for x in read1alignedpairs if x[2] != None] + read2alignedpairs = [x for x in read2alignedpairs if x[2] != None] + + #Add quality scores and query sequences + #will now be (querypos, refpos, refsequence, querysequence, qualscore) + for x in range(len(read1alignedpairs)): + alignedpair = read1alignedpairs[x] + querypos = alignedpair[0] + if querypos != None: + querynt = read1queryseq[querypos] + qualscore = read1qualities[querypos] + elif querypos == None: + querynt = 'X' # there is no query nt for a deletion + qualscore = 37 # there's no query position here, so make up a quality score + alignedpair = alignedpair + (querynt, qualscore) + read1alignedpairs[x] = alignedpair + + for x in range(len(read2alignedpairs)): + alignedpair = read2alignedpairs[x] + querypos = alignedpair[0] + if querypos != None: + querynt = read2queryseq[querypos] + qualscore = read2qualities[querypos] + elif querypos == None: + querynt = 'X' + qualscore = 37 + alignedpair = alignedpair + (querynt, qualscore) + read2alignedpairs[x] = alignedpair #if we have locations to mask, remove their locations from read1alignedpairs and read2alignedpairs #masklocations is a set of 0-based coordinates of snp locations to mask @@ -349,7 +360,8 @@ def getmismatches_pairedend(read1alignedpairs, read2alignedpairs, read1queryseq, 'a_a', 'a_t', 'a_c', 'a_g', 'a_n', 'g_a', 'g_t', 'g_c', 'g_g', 'g_n', 'c_a', 'c_t', 'c_c', 'c_g', 'c_n', - 't_a', 't_t', 't_c', 't_g', 't_n'] + 't_a', 't_t', 't_c', 't_g', 't_n', + 'a_x', 'g_x', 'c_x', 't_x', 'ng_xg'] #initialize dictionary for conv in possibleconvs: @@ -359,48 +371,57 @@ def getmismatches_pairedend(read1alignedpairs, read2alignedpairs, read1queryseq, #These locations (as defined by their reference positions) would be found both in read1alignedpairs and read2alignedpairs #Get the ref positions queried by the two reads - r1dict = {} #{reference position : [queryposition, reference sequence, quality]} + r1dict = {} #{reference position: [queryposition, reference sequence, querysequence, quality]} r2dict = {} for x in read1alignedpairs: - r1dict[int(x[1])] = [x[0], x[2], x[3]] + r1dict[int(x[1])] = [x[0], x[2].upper(), x[3].upper(), x[4]] for x in read2alignedpairs: - r2dict[int(x[1])] = [x[0], x[2], x[3]] + r2dict[int(x[1])] = [x[0], x[2].upper(), x[3].upper(), x[4]] - mergedalignedpairs = {} # {refpos : [R1querypos, R2querypos, R1refsequence, R2refsequence, R1quality, R2quality]} + # {refpos : [R1querypos, R2querypos, R1refsequence, R2refsequence, R1querysequence, R2querysequence, R1quality, R2quality]} + mergedalignedpairs = {} #For positions only in R1 or R2, querypos and refsequence are NA for the other read for refpos in r1dict: r1querypos = r1dict[refpos][0] r1refseq = r1dict[refpos][1] - r1quality = r1dict[refpos][2] + r1queryseq = r1dict[refpos][2] + r1quality = r1dict[refpos][3] if refpos in mergedalignedpairs: #this should not be possible because we are looking at r1 first r2querypos = mergedalignedpairs[refpos][1] r2refseq = mergedalignedpairs[refpos][3] - r2quality = mergedalignedpairs[refpos][5] - mergedalignedpairs[refpos] = [r1querypos, r2querypos, r1refseq, r2refseq, r1quality, r2quality] + r2queryseq = mergedalignedpairs[refpos][5] + r2quality = mergedalignedpairs[refpos][7] + mergedalignedpairs[refpos] = [r1querypos, r2querypos, r1refseq, + r2refseq, r1queryseq, r2queryseq, r1quality, r2quality] else: - mergedalignedpairs[refpos] = [r1querypos, 'NA', r1refseq, 'NA', r1quality, 'NA'] + mergedalignedpairs[refpos] = [r1querypos, 'NA', r1refseq, 'NA', r1queryseq, 'NA', r1quality, 'NA'] for refpos in r2dict: #same thing r2querypos = r2dict[refpos][0] r2refseq = r2dict[refpos][1] - r2quality = r2dict[refpos][2] + r2queryseq = r2dict[refpos][2] + r2quality = r2dict[refpos][3] if refpos in mergedalignedpairs: #if we saw it for r1 r1querypos = mergedalignedpairs[refpos][0] r1refseq = mergedalignedpairs[refpos][2] - r1quality = mergedalignedpairs[refpos][4] - mergedalignedpairs[refpos] = [r1querypos, r2querypos, r1refseq, r2refseq, r1quality, r2quality] + r1queryseq = mergedalignedpairs[refpos][4] + r1quality = mergedalignedpairs[refpos][6] + mergedalignedpairs[refpos] = [r1querypos, r2querypos, r1refseq, + r2refseq, r1queryseq, r2queryseq, r1quality, r2quality] else: - mergedalignedpairs[refpos] = ['NA', r2querypos, 'NA', r2refseq, 'NA', r2quality] + mergedalignedpairs[refpos] = ['NA', r2querypos, 'NA', r2refseq, 'NA', r2queryseq, 'NA', r2quality] #If we are only using read1 or only using read2, replace the positions in the non-used read with NA for refpos in mergedalignedpairs: - r1querypos, r2querypos, r1refseq, r2refseq, r1quality, r2quality = mergedalignedpairs[refpos] + r1querypos, r2querypos, r1refseq, r2refseq, r1queryseq, r2queryseq, r1quality, r2quality = mergedalignedpairs[refpos] if use_read1 and not use_read2: - updatedlist = [r1querypos, 'NA', r1refseq, 'NA', r1quality, 'NA'] + updatedlist = [r1querypos, 'NA', r1refseq, + 'NA', r1queryseq, 'NA', r1quality, 'NA'] mergedalignedpairs[refpos] = updatedlist elif use_read2 and not use_read1: - updatedlist = ['NA', r2querypos, 'NA', r2refseq, 'NA', r2quality] + updatedlist = ['NA', r2querypos, 'NA', + r2refseq, 'NA', r2queryseq, 'NA', r2quality] mergedalignedpairs[refpos] = updatedlist elif not use_read1 and not use_read2: print('ERROR: we have to use either read1 or read2, if not both.') @@ -408,128 +429,152 @@ def getmismatches_pairedend(read1alignedpairs, read2alignedpairs, read1queryseq, elif use_read1 and use_read2: pass - #Now go through mergedalignedpairs, looking for conversions. - #For positions observed both in r1 and r2, a conversion must be present in both reads, - #otherwise it will be recorded as not having a conversion. + #For positions observed both in r1 and r2, queryseq in both reads must match, otherwise the position is skipped. + #We are now keeping track of deletions as either g_x (reference G, query deletion) or ng_xg (ref nt 5' of G deleted in query) + #We have observed that sometimes RT skips the nucleotide *after* an oxidized G (after being from the RT's point of view) for refpos in mergedalignedpairs: + conv = None + conv2 = None #sometimes we can have 2 convs (for example the first nt of ng_xg could be both g_x and ng_xg) r1querypos = mergedalignedpairs[refpos][0] r2querypos = mergedalignedpairs[refpos][1] r1refseq = mergedalignedpairs[refpos][2] r2refseq = mergedalignedpairs[refpos][3] - r1quality = mergedalignedpairs[refpos][4] - r2quality = mergedalignedpairs[refpos][5] + r1queryseq = mergedalignedpairs[refpos][4] + r2queryseq = mergedalignedpairs[refpos][5] + r1quality = mergedalignedpairs[refpos][6] + r2quality = mergedalignedpairs[refpos][7] if r1querypos != 'NA' and r2querypos == 'NA': #this position queried by r1 only if read1strand == '-': #refseq needs to equal the sense strand (it is always initially defined as the + strand). read1 is always the sense strand. r1refseq = revcomp(r1refseq) + r1queryseq = revcomp(r1queryseq) #If reference is N, skip this position if r1refseq == 'N' or r1refseq == 'n': continue - - if r1refseq.isupper(): #not a conversion - conv = r1refseq.lower() + '_' + r1refseq.lower() - elif r1refseq.islower(): #is a conversion - querynt = read1queryseq[r1querypos] - if read1strand == '-': - querynt = revcomp(querynt) - conv = r1refseq.lower() + '_' + querynt.lower() - - if r1quality >= 30 and onlyoverlap == False: - convs[conv] +=1 - else: - pass + conv = r1refseq.lower() + '_' + r1queryseq.lower() + + if r1queryseq == 'X': + #Check if there is a reference G downstream of this position + if read1strand == '+': + downstreamrefpos = refpos + 1 + downstreamrefseq = mergedalignedpairs[downstreamrefpos][2].upper() + downstreamqueryseq = mergedalignedpairs[downstreamrefpos][4].upper() + elif read1strand == '-': + downstreamrefpos = refpos - 1 + downstreamrefseq = mergedalignedpairs[downstreamrefpos][2].upper() + downstreamqueryseq = mergedalignedpairs[downstreamrefpos][4].upper() + downstreamrefseq = revcomp(downstreamrefseq) + downstreamqueryseq = revcomp(downstreamqueryseq) + if downstreamrefseq == 'G': + conv2 = 'ng_xg' + + #Add conv(s) to dictionary + if r1quality >= minPhred and onlyoverlap == False: + # there will be some conversions (e.g. a_x that are not in convs) + if conv in convs: + convs[conv] +=1 + if conv2 == 'ng_xg' and conv != 'g_x': + convs[conv2] +=1 elif r1querypos == 'NA' and r2querypos != 'NA': #this position is queried by r2 only if read1strand == '-': - r2refseq = revcomp(r2refseq) #reference seq is independent of which read we are talking about + # reference seq is independent of which read we are talking about + #Read1 is always the sense strand. r1queryseq and r2queryseq are always + strand + #The reference sequence is always on the + strand. + #If read1 is on the - strand, we have already flipped reference seq (see a few lines above). + #We need to flip read2queryseq so that it is also - strand. + r2refseq = revcomp(r2refseq) + r2queryseq = revcomp(r2queryseq) if r2refseq == 'N' or r2refseq == 'n': continue - if r2refseq.isupper(): #not a conversion - conv = r2refseq.lower() + '_' + r2refseq.lower() - elif r2refseq.islower(): #is a conversion - querynt = read2queryseq[r2querypos] - if read1strand == '-': - #Read1 is always the sense strand. r1queryseq and r2queryseq are always + strand - #The reference sequence is always on the + strand. - #If read1 is on the - strand, we have already flipped reference seq (see a few lines above). - #We need to flip read2queryseq so that it is also - strand. - querynt = revcomp(querynt) - conv = r2refseq.lower() + '_' + querynt.lower() - - if r2quality >= 30 and onlyoverlap == False: - convs[conv] +=1 - else: - pass + conv = r2refseq.lower() + '_' + r2refseq.lower() + if r2queryseq == 'X': + #Check if there is a reference G downstream of this position + if read1strand == '+': + downstreamrefpos = refpos + 1 + downstreamrefseq = mergedalignedpairs[downstreamrefpos][2].upper() + downstreamqueryseq = mergedalignedpairs[downstreamrefpos][4].upper() + elif read1strand == '-': + downstreamrefpos = refpos - 1 + downstreamrefseq = mergedalignedpairs[downstreamrefpos][2].upper() + downstreamqueryseq = mergedalignedpairs[downstreamrefpos][4].upper() + downstreamrefseq = revcomp(downstreamrefseq) + downstreamqueryseq = revcomp(downstreamqueryseq) + if downstreamrefseq == 'G': + conv2 = 'ng_xg' + + #Add conv(s) to dictionary + if r2quality >= minPhred and onlyoverlap == False: + if conv in convs: + convs[conv] +=1 + if conv2 == 'ng_xg' and conv != 'g_x': + convs[conv2] +=1 elif r1querypos != 'NA' and r2querypos != 'NA': #this position is queried by both reads if read1strand == '-': r1refseq = revcomp(r1refseq) r2refseq = revcomp(r2refseq) + r1queryseq = revcomp(r1queryseq) + r2queryseq = revcomp(r2queryseq) if r1refseq == 'N' or r2refseq == 'N' or r1refseq == 'n' or r2refseq == 'n': continue - - #If the position is not high quality in both r1 and r2, skip it - if r1quality < 30 and r2quality < 30: - continue - - if r1refseq.isupper() and r2refseq.isupper(): #both reads agree it is not a conversion - conv = r1refseq.lower() + '_' + r1refseq.lower() - convs[conv] += 1 - - elif r1refseq.isupper() and r2refseq.islower(): #r1 says no conversion, r2 says conversion, so we say no conversion - conv = r1refseq.lower() + '_' + r1refseq.lower() - convs[conv] += 1 - - elif r1refseq.islower() and r2refseq.isupper(): #r1 says conversion, r2 says no conversion, so we say no conversion - conv = r2refseq.lower() + '_' + r2refseq.lower() - convs[conv] += 1 - - elif r1refseq.islower() and r2refseq.islower(): #both reads say there was a conversion - r1querynt = read1queryseq[r1querypos] - r2querynt = read2queryseq[r2querypos] - if read1strand == '-': - r1querynt = revcomp(r1querynt) - r2querynt = revcomp(r2querynt) - - #If the query nts don't match, skip this position - if r1querynt == r2querynt: - conv = r1refseq.lower() + '_' + r1querynt.lower() - convs[conv] +=1 - else: - pass + + r1result = r1refseq.lower() + '_' + r1queryseq.lower() + r2result = r2refseq.lower() + '_' + r2queryseq.lower() + + #Only record if r1 and r2 agree about what is going on + if r1result == r2result: + conv = r1refseq.lower() + '_' + r1queryseq.lower() + if r1queryseq == 'X': + #Check if there is a reference G downstream of this position + if read1strand == '+': + downstreamrefpos = refpos + 1 + downstreamrefseq = mergedalignedpairs[downstreamrefpos][2].upper() + downstreamqueryseq = mergedalignedpairs[downstreamrefpos][4].upper() + elif read1strand == '-': + downstreamrefpos = refpos - 1 + downstreamrefseq = mergedalignedpairs[downstreamrefpos][2].upper() + downstreamqueryseq = mergedalignedpairs[downstreamrefpos][4].upper() + downstreamrefseq = revcomp(downstreamrefseq) + downstreamqueryseq = revcomp(downstreamqueryseq) + if downstreamrefseq == 'G': + conv2 = 'ng_xg' + + #Add conv(s) to dictionary + #Only do conv2 (ng_xg) if conv is not g_x + if r1quality >= minPhred and r2quality >= minPhred: + if conv in convs: + convs[conv] +=1 + if conv2 == 'ng_xg' and conv != 'g_x': + convs[conv2] +=1 elif r1querypos == 'NA' and r2querypos == 'NA': #if we are using only read1 or read2, it's possible for this position in both reads to be NA continue #Does the number of g_t and/or g_c conversions meet our threshold? - if use_g_t and use_g_c: - if convs['g_t'] + convs['g_c'] >= nConv: - pass - elif convs['g_t'] + convs['g_c'] < nConv: - convs['g_t'] = 0 - convs['g_c'] = 0 - elif use_g_t and not use_g_c: - if convs['g_t'] >= nConv: - pass - elif convs['g_t'] < nConv: - convs['g_t'] = 0 - convs['g_c'] = 0 - elif use_g_c and not use_g_t: - if convs['g_c'] >= nConv: - pass - elif convs['g_c'] < nConv: - convs['g_c'] = 0 - convs['g_t'] = 0 - elif not use_g_t and not use_g_c: - print('ERROR: we have to be looking for at least either G->T or G->C if not both!!') + allconvs = ['g_c', 'g_t', 'g_x', 'ng_xg'] + convoptions = [use_g_c, use_g_t, use_g_x, use_ng_xg] + selectedconvs = [] + for ind, x in enumerate(allconvs): + if convoptions[ind] == True: + selectedconvs.append(x) + if not selectedconvs: + print('ERROR: we must be looking for at least one conversion type.') sys.exit() + + nConv_in_read = 0 + for x in selectedconvs: + nConv_in_read += convs[x] + if nConv_in_read < nConv: + for x in allconvs: + convs[x] = 0 return convs @@ -558,6 +603,8 @@ def summarize_convs(convs, outfile): t_g = 0 t_c = 0 t_n = 0 + g_x = 0 + ng_xg = 0 for read in convs: conv_in_read = convs[read] @@ -573,11 +620,13 @@ def summarize_convs(convs, outfile): c_g += conv_in_read['c_g'] c_n += conv_in_read['c_n'] - g += (conv_in_read['g_a'] + conv_in_read['g_t'] + conv_in_read['g_c'] + conv_in_read['g_g'] + conv_in_read['g_n']) + g += (conv_in_read['g_a'] + conv_in_read['g_t'] + conv_in_read['g_c'] + conv_in_read['g_g'] + conv_in_read['g_n'] + conv_in_read['g_x']) g_t += conv_in_read['g_t'] g_a += conv_in_read['g_a'] g_c += conv_in_read['g_c'] g_n += conv_in_read['g_n'] + g_x += conv_in_read['g_x'] + ng_xg += conv_in_read['ng_xg'] t += (conv_in_read['t_a'] + conv_in_read['t_t'] + conv_in_read['t_c'] + conv_in_read['t_g'] + conv_in_read['t_n']) t_g += conv_in_read['t_g'] @@ -598,7 +647,7 @@ def summarize_convs(convs, outfile): 'Tcount', 'T_Acount', 'T_Ccount', 'T_Gcount', 'T_Ncount', 'A_Trate', 'A_Crate', 'A_Grate', 'A_Nrate', 'C_Trate', 'C_Arate', 'C_Grate', 'C_Nrate', - 'G_Trate', 'G_Crate', 'G_Arate', 'G_Nrate', + 'G_Trate', 'G_Crate', 'G_Arate', 'G_Nrate', 'G_Xrate', 'NG_XGrate', 'T_Arate', 'T_Crate', 'T_Grate', 'T_Nrate', 'totalnt', 'totalconv', 'totalerrorrate' ]) + '\n') @@ -609,7 +658,7 @@ def summarize_convs(convs, outfile): str(t), str(t_a), str(t_c), str(t_g), str(t_n), str(a_t / a), str(a_c / a), str(a_g / a), str(a_n / a), str(c_t / c), str(c_a / c), str(c_g / c), str(c_n / c), - str(g_t / g), str(g_c / g), str(g_a / g), str(g_n / g), + str(g_t / g), str(g_c / g), str(g_a / g), str(g_n / g), str(g_x / g), str(ng_xg / g), str(t_a / t), str(t_c / t), str(t_g / t), str(t_n / t), str(totalnt), str(totalconv), str(totalerrorrate) ])) @@ -646,7 +695,7 @@ def split_bam(bam, nproc): return splitbams -def getmismatches(datatype, bam, onlyConsiderOverlap, snps, maskpositions, nConv, minMappingQual, nproc, use_g_t, use_g_c, use_read1, use_read2): +def getmismatches(datatype, bam, onlyConsiderOverlap, snps, maskpositions, nConv, minMappingQual, nproc, use_g_t, use_g_c, use_g_x, use_ng_xg, use_read1, use_read2, minPhred=30): #Actually run the mismatch code (calling iteratereads_pairedend) #use multiprocessing #If there's only one processor, easier to use iteratereads_pairedend() directly. @@ -658,7 +707,7 @@ def getmismatches(datatype, bam, onlyConsiderOverlap, snps, maskpositions, nConv for x in splitbams: if datatype == 'paired': argslist.append((x, bool(onlyConsiderOverlap), bool( - use_g_t), bool(use_g_c), bool(use_read1), bool(use_read2), nConv, minMappingQual, snps, maskpositions, 'low')) + use_g_t), bool(use_g_c), bool(use_g_x), bool(use_ng_xg), bool(use_read1), bool(use_read2), nConv, minMappingQual, minPhred, snps, maskpositions, 'low')) elif datatype == 'single': argslist.append((x, bool(use_g_t), bool(use_g_c), nConv, minMappingQual, snps, maskpositions, 'low')) @@ -696,7 +745,7 @@ def getmismatches(datatype, bam, onlyConsiderOverlap, snps, maskpositions, nConv if __name__ == '__main__': - convs, readcounter = iteratereads_singleend(sys.argv[1], True, True, 1, 255, None, None, 'high') + convs, readcounter = iteratereads_pairedend(sys.argv[1], True, True, True, True, True, True, True, 1, 255, 30, None, None, 'high') summarize_convs(convs, sys.argv[2]) diff --git a/pigpen.py b/pigpen.py index 9382311..ebcff5f 100644 --- a/pigpen.py +++ b/pigpen.py @@ -30,9 +30,12 @@ parser.add_argument('--onlyConsiderOverlap', action = 'store_true', help = 'Only consider conversions seen in both reads of a read pair? Only possible with paired end data.') parser.add_argument('--use_g_t', action = 'store_true', help = 'Consider G->T conversions?') parser.add_argument('--use_g_c', action = 'store_true', help = 'Consider G->C conversions?') + parser.add_argument('--use_g_x', action='store_true', help='Consider G->deletion conversions?') + parser.add_argument('--use_ng_xg', action='store_true', help='Consider NG->deletionG conversions?') parser.add_argument('--use_read1', action = 'store_true', help = 'Use read1 when looking for conversions? Only useful with paired end data.') parser.add_argument('--use_read2', action = 'store_true', help = 'Use read2 when looking for conversions? Only useful with paired end data.') parser.add_argument('--minMappingQual', type = int, help = 'Minimum mapping quality for a read to be considered in conversion counting. STAR unique mappers have MAPQ 255.', required = True) + parser.add_argument('--minPhred', type = int, help = 'Minimum phred quality score for a base to be considered. Default = 30', default = 30) parser.add_argument('--nConv', type = int, help = 'Minimum number of required G->T and/or G->C conversions in a read pair in order for those conversions to be counted. Default is 1.', default = 1) parser.add_argument('--outputDir', type = str, help = 'Output directory.', required = True) args = parser.parse_args() @@ -129,14 +132,14 @@ sampleparams['samplebam'] = os.path.abspath(samplebam) if args.nproc == 1: if args.datatype == 'paired': - convs, readcounter = iteratereads_pairedend(samplebam, args.onlyConsiderOverlap, args.use_g_t, args.use_g_c, - args.use_read1, args.use_read2, args.nConv, args.minMappingQual, snps, maskpositions, 'high') + convs, readcounter = iteratereads_pairedend(samplebam, args.onlyConsiderOverlap, args.use_g_t, args.use_g_c, args.use_g_x, args.use_ng_xg, + args.use_read1, args.use_read2, args.nConv, args.minMappingQual, args.minPhred, snps, maskpositions, 'high') elif args.datatype == 'single': convs, readcounter = iterratereads_singleend( samplebam, args.use_g_t, args.use_g_c, args.nConv, args.minMappingQual, snps, maskpostions, 'high') elif args.nproc > 1: convs = getmismatches(args.datatype, samplebam, args.onlyConsiderOverlap, snps, maskpositions, args.nConv, - args.minMappingQual, args.nproc, args.use_g_t, args.use_g_c, args.use_read1, args.use_read2) + args.minMappingQual, args.nproc, args.use_g_t, args.use_g_c, args.use_g_x, args.use_ng_xg, args.use_read1, args.use_read2, args.minPhred) print('Getting posterior probabilities from salmon alignment file...') postmasterbam = postmasterbams[ind] @@ -154,7 +157,7 @@ if not os.path.exists(args.outputDir): os.mkdir(args.outputDir) outputfile = os.path.join(args.outputDir, sample + '.pigpen.txt') - writeOutput(sampleparams, geneconvs, genecounts, geneid2genename, outputfile, args.use_g_t, args.use_g_c) + writeOutput(sampleparams, geneconvs, genecounts, geneid2genename, outputfile, args.use_g_t, args.use_g_c, args.use_g_x, args.use_ng_xg) print('Done!') #If there is a bed file of regions of interest supplied, then use that. Don't use the salmon/postmaster quantifications. @@ -180,15 +183,15 @@ sampleparams['samplebam'] = os.path.abspath(samplebam) if args.nproc == 1: if args.datatype == 'paired': - convs, readcounter = iteratereads_pairedend(samplebam, args.onlyConsiderOverlap, args.use_g_t, args.use_g_c, - args.use_read1, args.use_read2, args.nConv, args.minMappingQual, snps, maskpositions, 'high') + convs, readcounter = iteratereads_pairedend(samplebam, args.onlyConsiderOverlap, args.use_g_t, args.use_g_c, args.use_g_x, args.use_ng_xg, + args.use_read1, args.use_read2, args.nConv, args.minMappingQual, args.minPhred, snps, maskpositions, 'high') elif args.datatype == 'single': convs, readcounter = iterratereads_singleend( samplebam, args.use_g_t, args.use_g_c, args.nConv, args.minMappingQual, snps, maskpostions, 'high') elif args.nproc > 1: convs = getmismatches(args.datatype, samplebam, args.onlyConsiderOverlap, snps, maskpositions, - args.nConv, args.minMappingQual, args.nproc, args.use_g_t, args.use_g_c, args.use_read1, args.use_read2) + args.nConv, args.minMappingQual, args.nproc, args.use_g_t, args.use_g_c, args.use_g_x, args.use_ng_xg, args.use_read1, args.use_read2, args.minPhred) print('Assigning reads to genes in supplied bed file...') overlaps, numpairs = getReadOverlaps(samplebam, args.ROIbed, 'chrsort.txt') @@ -197,4 +200,4 @@ if not os.path.exists(args.outputDir): os.mkdir(args.outputDir) outputfile = os.path.join(args.outputDir, sample + '.pigpen.txt') - writeConvsPerGene(sampleparams, numreadspergene, convsPerGene, outputfile, args.use_g_t, args.use_g_c) \ No newline at end of file + writeConvsPerGene(sampleparams, numreadspergene, convsPerGene, outputfile, args.use_g_t, args.use_g_c, args.use_g_x, args.use_ng_xg) \ No newline at end of file From 9fad66f080356519e949f2f3ffcc2b1ed7a3b440 Mon Sep 17 00:00:00 2001 From: Matthew Taliaferro Date: Fri, 22 Mar 2024 09:33:34 -0600 Subject: [PATCH 067/108] update bacon for use with G deletions --- bacon_glm.py | 51 +++++++++++++++++++++++++-------------------------- 1 file changed, 25 insertions(+), 26 deletions(-) diff --git a/bacon_glm.py b/bacon_glm.py index 78de2d6..b4cce83 100644 --- a/bacon_glm.py +++ b/bacon_glm.py @@ -65,7 +65,7 @@ def makePORCdf(samp_conds_file, minreads, considernonG): else: genesinall = genesinall.intersection(set(dfgenes)) - columnstokeep = ['GeneID', 'GeneName', 'sample', 'numreads', 'G_Trate', 'G_Crate', 'convGrate', 'porc'] + columnstokeep = ['GeneID', 'GeneName', 'sample', 'numreads', 'G_Trate', 'G_Crate', 'G_Xrate', 'NG_XGrate', 'convGrate', 'porc'] df = df[columnstokeep] dfs.append(df) @@ -80,7 +80,7 @@ def makePORCdf(samp_conds_file, minreads, considernonG): #turn from long into wide df = df.pivot_table(index=['GeneID', 'GeneName'], columns='sample', values=[ - 'numreads', 'G_Trate', 'G_Crate', 'convGrate', 'porc']).reset_index() + 'numreads', 'G_Trate', 'G_Crate', 'G_Xrate', 'NG_XGrate', 'convGrate', 'porc']).reset_index() #flatten multiindex column names df.columns = ["_".join(a) if '' not in a else a[0] for a in df.columns.to_flat_index()] @@ -149,10 +149,6 @@ def calcDeltaPORC(porcdf, sampconds, conditionA, conditionB, metric): if metric == 'porc': porcdf = porcdf.assign(delta_porc = deltametrics) - elif metric == 'G_Trate': - porcdf = porcdf.assign(delta_G_Trate = deltametrics) - elif metric == 'G_Crate': - porcdf = porcdf.assign(delta_G_Crate = deltametrics) elif metric == 'convGrate': porcdf = porcdf.assign(delta_convGrate = deltametrics) @@ -162,18 +158,24 @@ def calcDeltaPORC(porcdf, sampconds, conditionA, conditionB, metric): return porcdf -def makeContingencyTable(row, use_g_t, use_g_c): +def makeContingencyTable(row, use_g_t, use_g_c, use_g_x, use_ng_xg): #Given a row from a pigpen df, return a contingency table of the form #[[convG, nonconvG], [convnonG, nonconvnonG]] - if use_g_t and use_g_c: - convG = row['g_t'] + row['g_c'] - elif use_g_t and not use_g_c: - convG = row['g_t'] - elif use_g_c and not use_g_t: - convG = row['g_c'] + #Identity of converted Gs depends on which ones you want to count + possibleconvs = ['g_t', 'g_c', 'g_x', 'ng_xg'] + possibleoptions = [use_g_t, use_g_c, use_g_x, use_ng_xg] + selectedconvs = [] + for ind, option in enumerate(possibleoptions): + if option == True: + selectedconvs.append(possibleconvs[ind]) + + convG = 0 + for conv in selectedconvs: + convG += row[conv] + nonconvG = row['g_g'] - convnonG = row['a_t'] + row['a_c'] + row['a_g'] + row['c_a'] + row['c_t'] + row['c_g'] + row['t_a'] + row['t_c'] + row['t_g'] + convnonG = row['a_t'] + row['a_c'] + row['a_g'] + row['a_x'] + row['c_a'] + row['c_t'] + row['c_g'] + row['c_x'] + row['t_a'] + row['t_c'] + row['t_g'] + row['t_x'] nonconvnonG = row['c_c'] + row['t_t'] + row['a_a'] conttable = [[convG, nonconvG], [convnonG, nonconvnonG]] @@ -293,7 +295,7 @@ def multihyp(pvalues): return correctedps -def getpvalues(samp_conds_file, conditionA, conditionB, considernonG, filteredgenes, use_g_t, use_g_c): +def getpvalues(samp_conds_file, conditionA, conditionB, considernonG, filteredgenes, use_g_t, use_g_c, use_g_x, use_ng_xg): #each contingency table will be: [[convG, nonconvG], [convnonG, nonconvnonG]] #These will be stored in a dictionary: {gene : [condAtables, condBtables]} conttables = {} @@ -312,7 +314,7 @@ def getpvalues(samp_conds_file, conditionA, conditionB, considernonG, filteredge condition = line[2] df = pd.read_csv(pigpenfile, sep = '\t', index_col = False, header=0, comment = '#') for idx, row in df.iterrows(): - conttable = makeContingencyTable(row, use_g_t, use_g_c) + conttable = makeContingencyTable(row, use_g_t, use_g_c, use_g_x, use_ng_xg) gene = row['GeneID'] #If this isn't one of the genes that passes read count filters in all files, skip it if gene not in filteredgenes: @@ -377,8 +379,10 @@ def formatporcdf(porcdf): help='One of the two conditions in the \'condition\' column of sampconds. Deltaporc is defined as conditionB - conditionA.') parser.add_argument('--conditionB', type=str, help='One of the two conditions in the \'condition\' column of sampconds. Deltaporc is defined as conditionB - conditionA.') - parser.add_argument('--use_g_t', help = 'Consider G to T mutations when calculating G conversion rate?', action = 'store_true') - parser.add_argument('--use_g_c', help = 'Consider G to C mutations when calculating G conversion rate?', action = 'store_true') + parser.add_argument('--use_g_t', help = 'Consider G to T mutations in contingency table?', action = 'store_true') + parser.add_argument('--use_g_c', help = 'Consider G to C mutations in contingency table?', action = 'store_true') + parser.add_argument('--use_g_x', help = 'Consider G to deletion mutations in contingency table?', action = 'store_true') + parser.add_argument('--use_ng_xg', help = 'Consider NG to deletionG mutations in contingency table?', action = 'store_true') parser.add_argument('--considernonG', help='Consider conversions of nonG residues to normalize for overall mutation rate?', action = 'store_true') parser.add_argument('--output', type = str, help = 'Output file.') @@ -392,13 +396,8 @@ def formatporcdf(porcdf): #What metric should we care about? if args.considernonG: metric = 'porc' - elif args.use_g_t and not args.use_g_c: - metric = 'G_Trate' - elif args.use_g_c and not args.use_g_t: - metric = 'G_Crate' - elif args.use_g_t and args.use_g_c: - metric = 'convGrate' - + else: + metric = 'convGrate' #what goes into this rate is set in the corresponding pigpen run #Make df of PORC values porcdf = makePORCdf(args.sampconds, args.minreads, args.considernonG) @@ -406,7 +405,7 @@ def formatporcdf(porcdf): porcdf = calcDeltaPORC(porcdf, args.sampconds, args.conditionA, args.conditionB, metric) filteredgenes = porcdf['GeneID'].tolist() #Get p values and corrected p values - pdf = getpvalues(args.sampconds, args.conditionA, args.conditionB, args.considernonG, filteredgenes, args.use_g_t, args.use_g_c) + pdf = getpvalues(args.sampconds, args.conditionA, args.conditionB, args.considernonG, filteredgenes, args.use_g_t, args.use_g_c, args.use_g_x, args.use_ng_xg) #add p values and FDR porcdf = pd.merge(porcdf, pdf, on = ['GeneID']) #Format floats From 4c1e571913cf6ef6aad19fa8d33d99b3c354fa7b Mon Sep 17 00:00:00 2001 From: Matthew Taliaferro Date: Fri, 22 Mar 2024 09:35:38 -0600 Subject: [PATCH 068/108] fix typo in pigpen.py --- pigpen.py | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/pigpen.py b/pigpen.py index ebcff5f..215c774 100644 --- a/pigpen.py +++ b/pigpen.py @@ -8,8 +8,8 @@ from snps import getSNPs, recordSNPs from maskpositions import readmaskbed from getmismatches import iteratereads_pairedend, getmismatches -from assignreads_salmon import getpostmasterassignments, assigntotxs, collapsetogene, readspergene, writeOutput -#from assignreads_salmon_ensembl import getpostmasterassignments, assigntotxs, collapsetogene, readspergene, writeOutput +#from assignreads_salmon import getpostmasterassignments, assigntotxs, collapsetogene, readspergene, writeOutput +from assignreads_salmon_ensembl import getpostmasterassignments, assigntotxs, collapsetogene, readspergene, writeOutput from assignreads import getReadOverlaps, processOverlaps from conversionsPerGene import getPerGene, writeConvsPerGene @@ -57,6 +57,7 @@ samplenames = args.samplenames.split(',') salmonquants = [os.path.join(x, 'salmon', '{0}.quant.sf'.format(x)) for x in samplenames] starbams = [os.path.join(x, 'STAR', '{0}Aligned.sortedByCoord.out.bam'.format(x)) for x in samplenames] #non-deduplicated bams + #starbams = [os.path.join(x, 'STAR', '{0}.dedup.bam'.format(x)) for x in samplenames] postmasterbams = [os.path.join(x, 'postmaster', '{0}.postmaster.bam'.format(x)) for x in samplenames] #Take in list of control samples, make list of their corresponding star bams for SNP calling @@ -136,7 +137,7 @@ args.use_read1, args.use_read2, args.nConv, args.minMappingQual, args.minPhred, snps, maskpositions, 'high') elif args.datatype == 'single': convs, readcounter = iterratereads_singleend( - samplebam, args.use_g_t, args.use_g_c, args.nConv, args.minMappingQual, snps, maskpostions, 'high') + samplebam, args.use_g_t, args.use_g_c, args.nConv, args.minMappingQual, snps, maskpositions, 'high') elif args.nproc > 1: convs = getmismatches(args.datatype, samplebam, args.onlyConsiderOverlap, snps, maskpositions, args.nConv, args.minMappingQual, args.nproc, args.use_g_t, args.use_g_c, args.use_g_x, args.use_ng_xg, args.use_read1, args.use_read2, args.minPhred) @@ -187,7 +188,7 @@ args.use_read1, args.use_read2, args.nConv, args.minMappingQual, args.minPhred, snps, maskpositions, 'high') elif args.datatype == 'single': convs, readcounter = iterratereads_singleend( - samplebam, args.use_g_t, args.use_g_c, args.nConv, args.minMappingQual, snps, maskpostions, 'high') + samplebam, args.use_g_t, args.use_g_c, args.nConv, args.minMappingQual, snps, maskpositions, 'high') elif args.nproc > 1: convs = getmismatches(args.datatype, samplebam, args.onlyConsiderOverlap, snps, maskpositions, From 35e50f7013625534c9ded74cf4200f5d7291d465 Mon Sep 17 00:00:00 2001 From: Matthew Taliaferro Date: Fri, 22 Mar 2024 09:36:07 -0600 Subject: [PATCH 069/108] add try/except for deletions at end of read --- getmismatches.py | 50 ++++++++++++++++++++++++++++++++++++------------ 1 file changed, 38 insertions(+), 12 deletions(-) diff --git a/getmismatches.py b/getmismatches.py index 707dd38..74ac132 100644 --- a/getmismatches.py +++ b/getmismatches.py @@ -461,16 +461,26 @@ def getmismatches_pairedend(read1alignedpairs, read2alignedpairs, read1queryseq, #Check if there is a reference G downstream of this position if read1strand == '+': downstreamrefpos = refpos + 1 - downstreamrefseq = mergedalignedpairs[downstreamrefpos][2].upper() - downstreamqueryseq = mergedalignedpairs[downstreamrefpos][4].upper() + #It's possible that downstreamrefpos is not in mergedalignedpairs because this position is at the end of the read + try: + downstreamrefseq = mergedalignedpairs[downstreamrefpos][2].upper() + downstreamqueryseq = mergedalignedpairs[downstreamrefpos][4].upper() + except KeyError: + downstreamrefseq, downstreamqueryseq = None, None elif read1strand == '-': downstreamrefpos = refpos - 1 - downstreamrefseq = mergedalignedpairs[downstreamrefpos][2].upper() - downstreamqueryseq = mergedalignedpairs[downstreamrefpos][4].upper() + try: + downstreamrefseq = mergedalignedpairs[downstreamrefpos][2].upper() + downstreamqueryseq = mergedalignedpairs[downstreamrefpos][4].upper() + except KeyError: + downstreamrefseq, downstreamqueryseq = None, None downstreamrefseq = revcomp(downstreamrefseq) downstreamqueryseq = revcomp(downstreamqueryseq) if downstreamrefseq == 'G': conv2 = 'ng_xg' + #If this is a non-g deletion and is downstream of a g, we can't be sure if this deletion is due to this nucleotide or the downstream g + if conv in ['a_x', 't_x', 'c_x']: + conv = None #Add conv(s) to dictionary if r1quality >= minPhred and onlyoverlap == False: @@ -498,16 +508,24 @@ def getmismatches_pairedend(read1alignedpairs, read2alignedpairs, read1queryseq, #Check if there is a reference G downstream of this position if read1strand == '+': downstreamrefpos = refpos + 1 - downstreamrefseq = mergedalignedpairs[downstreamrefpos][2].upper() - downstreamqueryseq = mergedalignedpairs[downstreamrefpos][4].upper() + try: + downstreamrefseq = mergedalignedpairs[downstreamrefpos][2].upper() + downstreamqueryseq = mergedalignedpairs[downstreamrefpos][4].upper() + except KeyError: + downstreamrefseq, downstreamqueryseq = None, None elif read1strand == '-': downstreamrefpos = refpos - 1 - downstreamrefseq = mergedalignedpairs[downstreamrefpos][2].upper() - downstreamqueryseq = mergedalignedpairs[downstreamrefpos][4].upper() + try: + downstreamrefseq = mergedalignedpairs[downstreamrefpos][2].upper() + downstreamqueryseq = mergedalignedpairs[downstreamrefpos][4].upper() + except KeyError: + downstreamrefseq, downstreamqueryseq = None, None downstreamrefseq = revcomp(downstreamrefseq) downstreamqueryseq = revcomp(downstreamqueryseq) if downstreamrefseq == 'G': conv2 = 'ng_xg' + if conv in ['a_x', 't_x', 'c_x']: + conv = None #Add conv(s) to dictionary if r2quality >= minPhred and onlyoverlap == False: @@ -536,16 +554,24 @@ def getmismatches_pairedend(read1alignedpairs, read2alignedpairs, read1queryseq, #Check if there is a reference G downstream of this position if read1strand == '+': downstreamrefpos = refpos + 1 - downstreamrefseq = mergedalignedpairs[downstreamrefpos][2].upper() - downstreamqueryseq = mergedalignedpairs[downstreamrefpos][4].upper() + try: + downstreamrefseq = mergedalignedpairs[downstreamrefpos][2].upper() + downstreamqueryseq = mergedalignedpairs[downstreamrefpos][4].upper() + except KeyError: + downstreamrefseq, downstreamqueryseq = None, None elif read1strand == '-': downstreamrefpos = refpos - 1 - downstreamrefseq = mergedalignedpairs[downstreamrefpos][2].upper() - downstreamqueryseq = mergedalignedpairs[downstreamrefpos][4].upper() + try: + downstreamrefseq = mergedalignedpairs[downstreamrefpos][2].upper() + downstreamqueryseq = mergedalignedpairs[downstreamrefpos][4].upper() + except KeyError: + downstreamrefseq, downstreamqueryseq = None, None downstreamrefseq = revcomp(downstreamrefseq) downstreamqueryseq = revcomp(downstreamqueryseq) if downstreamrefseq == 'G': conv2 = 'ng_xg' + if conv in ['a_x', 't_x', 'c_x']: + conv = None #Add conv(s) to dictionary #Only do conv2 (ng_xg) if conv is not g_x From 4c8790ba347c13d40262ba1d60a8389053855bb0 Mon Sep 17 00:00:00 2001 From: Matthew Taliaferro Date: Mon, 3 Feb 2025 10:42:44 -0700 Subject: [PATCH 070/108] add mismatch code for MPRA data --- getmismatches_MPRA.py | 460 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 460 insertions(+) create mode 100644 getmismatches_MPRA.py diff --git a/getmismatches_MPRA.py b/getmismatches_MPRA.py new file mode 100644 index 0000000..51be1bc --- /dev/null +++ b/getmismatches_MPRA.py @@ -0,0 +1,460 @@ +import pysam +import os +import re +import sys +from collections import defaultdict +import numpy as np +import pandas as pd +import subprocess +import argparse + + +def revcomp(nt): + revcompdict = { + 'G' : 'C', + 'C' : 'G', + 'A' : 'T', + 'T' : 'A', + 'N' : 'N', + 'g' : 'c', + 'c' : 'g', + 'a' : 't', + 't' : 'a', + 'n' : 'n', + 'X': 'X', + None: None, + 'NA': 'NA' + } + + nt_rc = revcompdict[nt] + + return nt_rc + + +def read_pair_generator(bam, region_string=None): + """ + Generate read pairs in a BAM file or within a region string. + Reads are added to read_dict until a pair is found. + https://www.biostars.org/p/306041/ + """ + read_dict = defaultdict(lambda: [None, None]) + for read in bam: + if not read.is_proper_pair or read.is_secondary or read.is_supplementary or read.mate_is_unmapped or read.is_unmapped: + continue + qname = read.query_name + if qname not in read_dict: + if read.is_read1: + read_dict[qname][0] = read + else: + read_dict[qname][1] = read + else: + if read.is_read1: + yield read, read_dict[qname][1] + else: + yield read_dict[qname][0], read + del read_dict[qname] + + +def iteratereads_pairedend(bam, onlyConsiderOverlap, use_read1, use_read2, nConv, minMappingQual, minPhred=30): + #Iterate over reads in a paired end alignment file. + #Find nt conversion locations for each read. + #For locations interrogated by both mates of read pair, conversion must exist in both mates in order to count + #Store the number of conversions for each read in a dictionary + + #Quality score array is always in the same order as query_sequence, which is always on the + strand + #Bam must contain MD tags + + if onlyConsiderOverlap == 'True': + onlyConsiderOverlap = True + elif onlyConsiderOverlap == 'False': + onlyConsiderOverlap = False + + queriednts = [] + readcounter = 0 + convs = {} # {oligoname : [readcount, {dictionary of all conversions}] + save = pysam.set_verbosity(0) + with pysam.AlignmentFile(bam, 'r') as infh: + print('Finding nucleotide conversions in {0}...'.format( + os.path.basename(bam))) + for read1, read2 in read_pair_generator(infh): + + #Just double check that the pairs are matched + if read1.query_name != read2.query_name: + continue + + if read1.reference_name != read2.reference_name: + continue + + #Check mapping quality + #MapQ is 255 for uniquely aligned reads FOR STAR ONLY + #MapQ is (I think) >= 2 for uniquely aligned reads from bowtie2 + if read1.mapping_quality < minMappingQual or read2.mapping_quality < minMappingQual: + continue + + readcounter += 1 + if readcounter % 1000000 == 0: + print('Finding nucleotide conversions in read {0}...'.format( + readcounter)) + + oligo = read1.reference_name + read1queryseq = read1.query_sequence + read1alignedpairs = read1.get_aligned_pairs(with_seq=True) + if read1.is_reverse: + read1strand = '-' + elif not read1.is_reverse: + read1strand = '+' + + read2queryseq = read2.query_sequence + read2alignedpairs = read2.get_aligned_pairs(with_seq=True) + if read2.is_reverse: + read2strand = '-' + elif not read2.is_reverse: + read2strand = '+' + + read1qualities = list(read1.query_qualities) # phred scores + read2qualities = list(read2.query_qualities) + + convs_in_read = getmismatches_pairedend(read1alignedpairs, read2alignedpairs, read1queryseq, read2queryseq, read1qualities, read2qualities, read1strand, read2strand, onlyConsiderOverlap, nConv, minPhred, use_read1, use_read2) + queriednts.append(sum(convs_in_read.values())) + + #Add to our running dictionary + if oligo not in convs: + convs[oligo] = [1, convs_in_read] + elif oligo in convs: + readcount = convs[oligo][0] + readcount +=1 + oligodict = convs[oligo][1] + for conv in convs_in_read: + convcount = convs_in_read[conv] + oligodict[conv] += convcount + convs[oligo] = [readcount, oligodict] + + return convs + + +def getmismatches_pairedend(read1alignedpairs, read2alignedpairs, read1queryseq, read2queryseq, read1qualities, read2qualities, read1strand, read2strand, masklocations, onlyoverlap, nConv, minPhred, use_read1, use_read2): + #remove tuples that have None + #These are either intronic or might have been soft-clipped + #Tuples are (querypos, refpos, refsequence) + #If there is a substitution, refsequence is lower case + + #For now, forget insertions. Get rid of any position where reference position is None. + read1alignedpairs = [x for x in read1alignedpairs if x[1] != None] + read2alignedpairs = [x for x in read2alignedpairs if x[1] != None] + read1alignedpairs = [x for x in read1alignedpairs if x[2] != None] + read2alignedpairs = [x for x in read2alignedpairs if x[2] != None] + + #Add quality scores and query sequences + #will now be (querypos, refpos, refsequence, querysequence, qualscore) + for x in range(len(read1alignedpairs)): + alignedpair = read1alignedpairs[x] + querypos = alignedpair[0] + if querypos != None: + querynt = read1queryseq[querypos] + qualscore = read1qualities[querypos] + elif querypos == None: + querynt = 'X' # there is no query nt for a deletion + qualscore = 37 # there's no query position here, so make up a quality score + alignedpair = alignedpair + (querynt, qualscore) + read1alignedpairs[x] = alignedpair + + for x in range(len(read2alignedpairs)): + alignedpair = read2alignedpairs[x] + querypos = alignedpair[0] + if querypos != None: + querynt = read2queryseq[querypos] + qualscore = read2qualities[querypos] + elif querypos == None: + querynt = 'X' + qualscore = 37 + alignedpair = alignedpair + (querynt, qualscore) + read2alignedpairs[x] = alignedpair + + #if we have locations to mask, remove their locations from read1alignedpairs and read2alignedpairs + #masklocations is a set of 0-based coordinates of snp locations to mask + if masklocations: + read1alignedpairs = [x for x in read1alignedpairs if x[1] not in masklocations] + read2alignedpairs = [x for x in read2alignedpairs if x[1] not in masklocations] + + convs = {} #counts of conversions x_y where x is reference sequence and y is query sequence + + possibleconvs = [ + 'a_a', 'a_t', 'a_c', 'a_g', 'a_n', + 'g_a', 'g_t', 'g_c', 'g_g', 'g_n', + 'c_a', 'c_t', 'c_c', 'c_g', 'c_n', + 't_a', 't_t', 't_c', 't_g', 't_n', + 'a_x', 'g_x', 'c_x', 't_x', 'ng_xg'] + + #initialize dictionary + for conv in possibleconvs: + convs[conv] = 0 + + #For locations interrogated by both mates of read pair, conversion must exist in both mates in order to count + #These locations (as defined by their reference positions) would be found both in read1alignedpairs and read2alignedpairs + + #Get the ref positions queried by the two reads + r1dict = {} #{reference position: [queryposition, reference sequence, querysequence, quality]} + r2dict = {} + for x in read1alignedpairs: + r1dict[int(x[1])] = [x[0], x[2].upper(), x[3].upper(), x[4]] + for x in read2alignedpairs: + r2dict[int(x[1])] = [x[0], x[2].upper(), x[3].upper(), x[4]] + + # {refpos : [R1querypos, R2querypos, R1refsequence, R2refsequence, R1querysequence, R2querysequence, R1quality, R2quality]} + mergedalignedpairs = {} + #For positions only in R1 or R2, querypos and refsequence are NA for the other read + for refpos in r1dict: + r1querypos = r1dict[refpos][0] + r1refseq = r1dict[refpos][1] + r1queryseq = r1dict[refpos][2] + r1quality = r1dict[refpos][3] + if refpos in mergedalignedpairs: #this should not be possible because we are looking at r1 first + r2querypos = mergedalignedpairs[refpos][1] + r2refseq = mergedalignedpairs[refpos][3] + r2queryseq = mergedalignedpairs[refpos][5] + r2quality = mergedalignedpairs[refpos][7] + mergedalignedpairs[refpos] = [r1querypos, r2querypos, r1refseq, + r2refseq, r1queryseq, r2queryseq, r1quality, r2quality] + else: + mergedalignedpairs[refpos] = [r1querypos, 'NA', r1refseq, 'NA', r1queryseq, 'NA', r1quality, 'NA'] + + for refpos in r2dict: + #same thing + r2querypos = r2dict[refpos][0] + r2refseq = r2dict[refpos][1] + r2queryseq = r2dict[refpos][2] + r2quality = r2dict[refpos][3] + if refpos in mergedalignedpairs: #if we saw it for r1 + r1querypos = mergedalignedpairs[refpos][0] + r1refseq = mergedalignedpairs[refpos][2] + r1queryseq = mergedalignedpairs[refpos][4] + r1quality = mergedalignedpairs[refpos][6] + mergedalignedpairs[refpos] = [r1querypos, r2querypos, r1refseq, + r2refseq, r1queryseq, r2queryseq, r1quality, r2quality] + else: + mergedalignedpairs[refpos] = ['NA', r2querypos, 'NA', r2refseq, 'NA', r2queryseq, 'NA', r2quality] + + #If we are only using read1 or only using read2, replace the positions in the non-used read with NA + for refpos in mergedalignedpairs: + r1querypos, r2querypos, r1refseq, r2refseq, r1queryseq, r2queryseq, r1quality, r2quality = mergedalignedpairs[refpos] + if use_read1 and not use_read2: + updatedlist = [r1querypos, 'NA', r1refseq, + 'NA', r1queryseq, 'NA', r1quality, 'NA'] + mergedalignedpairs[refpos] = updatedlist + elif use_read2 and not use_read1: + updatedlist = ['NA', r2querypos, 'NA', + r2refseq, 'NA', r2queryseq, 'NA', r2quality] + mergedalignedpairs[refpos] = updatedlist + elif not use_read1 and not use_read2: + print('ERROR: we have to use either read1 or read2, if not both.') + sys.exit() + elif use_read1 and use_read2: + pass + + #Now go through mergedalignedpairs, looking for conversions. + #For positions observed both in r1 and r2, queryseq in both reads must match, otherwise the position is skipped. + #We are now keeping track of deletions as either g_x (reference G, query deletion) or ng_xg (ref nt 5' of G deleted in query) + #We have observed that sometimes RT skips the nucleotide *after* an oxidized G (after being from the RT's point of view) + + for refpos in mergedalignedpairs: + conv = None + conv2 = None #sometimes we can have 2 convs (for example the first nt of ng_xg could be both g_x and ng_xg) + r1querypos = mergedalignedpairs[refpos][0] + r2querypos = mergedalignedpairs[refpos][1] + r1refseq = mergedalignedpairs[refpos][2] + r2refseq = mergedalignedpairs[refpos][3] + r1queryseq = mergedalignedpairs[refpos][4] + r2queryseq = mergedalignedpairs[refpos][5] + r1quality = mergedalignedpairs[refpos][6] + r2quality = mergedalignedpairs[refpos][7] + + if r1querypos != 'NA' and r2querypos == 'NA': #this position queried by r1 only + if read1strand == '-': + #refseq needs to equal the sense strand (it is always initially defined as the + strand). read1 is always the sense strand. + r1refseq = revcomp(r1refseq) + r1queryseq = revcomp(r1queryseq) + + #If reference is N, skip this position + if r1refseq == 'N' or r1refseq == 'n': + continue + conv = r1refseq.lower() + '_' + r1queryseq.lower() + + if r1queryseq == 'X': + #Check if there is a reference G downstream of this position + if read1strand == '+': + downstreamrefpos = refpos + 1 + #It's possible that downstreamrefpos is not in mergedalignedpairs because this position is at the end of the read + try: + downstreamrefseq = mergedalignedpairs[downstreamrefpos][2].upper() + downstreamqueryseq = mergedalignedpairs[downstreamrefpos][4].upper() + except KeyError: + downstreamrefseq, downstreamqueryseq = None, None + elif read1strand == '-': + downstreamrefpos = refpos - 1 + try: + downstreamrefseq = mergedalignedpairs[downstreamrefpos][2].upper() + downstreamqueryseq = mergedalignedpairs[downstreamrefpos][4].upper() + except KeyError: + downstreamrefseq, downstreamqueryseq = None, None + downstreamrefseq = revcomp(downstreamrefseq) + downstreamqueryseq = revcomp(downstreamqueryseq) + if downstreamrefseq == 'G': + conv2 = 'ng_xg' + #If this is a non-g deletion and is downstream of a g, we can't be sure if this deletion is due to this nucleotide or the downstream g + if conv in ['a_x', 't_x', 'c_x']: + conv = None + + #Add conv(s) to dictionary + if r1quality >= minPhred and onlyoverlap == False: + # there will be some conversions (e.g. a_x that are not in convs) + if conv in convs: + convs[conv] +=1 + if conv2 == 'ng_xg' and conv != 'g_x': + convs[conv2] +=1 + + elif r1querypos == 'NA' and r2querypos != 'NA': #this position is queried by r2 only + if read1strand == '-': + # reference seq is independent of which read we are talking about + #Read1 is always the sense strand. r1queryseq and r2queryseq are always + strand + #The reference sequence is always on the + strand. + #If read1 is on the - strand, we have already flipped reference seq (see a few lines above). + #We need to flip read2queryseq so that it is also - strand. + r2refseq = revcomp(r2refseq) + r2queryseq = revcomp(r2queryseq) + + if r2refseq == 'N' or r2refseq == 'n': + continue + + conv = r2refseq.lower() + '_' + r2refseq.lower() + if r2queryseq == 'X': + #Check if there is a reference G downstream of this position + if read1strand == '+': + downstreamrefpos = refpos + 1 + try: + downstreamrefseq = mergedalignedpairs[downstreamrefpos][2].upper() + downstreamqueryseq = mergedalignedpairs[downstreamrefpos][4].upper() + except KeyError: + downstreamrefseq, downstreamqueryseq = None, None + elif read1strand == '-': + downstreamrefpos = refpos - 1 + try: + downstreamrefseq = mergedalignedpairs[downstreamrefpos][2].upper() + downstreamqueryseq = mergedalignedpairs[downstreamrefpos][4].upper() + except KeyError: + downstreamrefseq, downstreamqueryseq = None, None + downstreamrefseq = revcomp(downstreamrefseq) + downstreamqueryseq = revcomp(downstreamqueryseq) + if downstreamrefseq == 'G': + conv2 = 'ng_xg' + if conv in ['a_x', 't_x', 'c_x']: + conv = None + + #Add conv(s) to dictionary + if r2quality >= minPhred and onlyoverlap == False: + if conv in convs: + convs[conv] +=1 + if conv2 == 'ng_xg' and conv != 'g_x': + convs[conv2] +=1 + + elif r1querypos != 'NA' and r2querypos != 'NA': #this position is queried by both reads + if read1strand == '-': + r1refseq = revcomp(r1refseq) + r2refseq = revcomp(r2refseq) + r1queryseq = revcomp(r1queryseq) + r2queryseq = revcomp(r2queryseq) + + if r1refseq == 'N' or r2refseq == 'N' or r1refseq == 'n' or r2refseq == 'n': + continue + + r1result = r1refseq.lower() + '_' + r1queryseq.lower() + r2result = r2refseq.lower() + '_' + r2queryseq.lower() + + #Only record if r1 and r2 agree about what is going on + if r1result == r2result: + conv = r1refseq.lower() + '_' + r1queryseq.lower() + if r1queryseq == 'X': + #Check if there is a reference G downstream of this position + if read1strand == '+': + downstreamrefpos = refpos + 1 + try: + downstreamrefseq = mergedalignedpairs[downstreamrefpos][2].upper() + downstreamqueryseq = mergedalignedpairs[downstreamrefpos][4].upper() + except KeyError: + downstreamrefseq, downstreamqueryseq = None, None + elif read1strand == '-': + downstreamrefpos = refpos - 1 + try: + downstreamrefseq = mergedalignedpairs[downstreamrefpos][2].upper() + downstreamqueryseq = mergedalignedpairs[downstreamrefpos][4].upper() + except KeyError: + downstreamrefseq, downstreamqueryseq = None, None + downstreamrefseq = revcomp(downstreamrefseq) + downstreamqueryseq = revcomp(downstreamqueryseq) + if downstreamrefseq == 'G': + conv2 = 'ng_xg' + if conv in ['a_x', 't_x', 'c_x']: + conv = None + + #Add conv(s) to dictionary + #Only do conv2 (ng_xg) if conv is not g_x + if r1quality >= minPhred and r2quality >= minPhred: + if conv in convs: + convs[conv] +=1 + if conv2 == 'ng_xg' and conv != 'g_x': + convs[conv2] +=1 + + elif r1querypos == 'NA' and r2querypos == 'NA': #if we are using only read1 or read2, it's possible for this position in both reads to be NA + continue + + #Does the number of t_c conversions meet our threshold? + if convs['t_c'] >= nConv: + pass + elif convs['t_c'] < nConv: + convs['t_c'] = 0 + + return convs + +def writeOutput(convs, outfile): + #Write conv dict in text output table + # + possibleconvs = [ + 'a_a', 'a_t', 'a_c', 'a_g', 'a_n', + 'g_a', 'g_t', 'g_c', 'g_g', 'g_n', + 'c_a', 'c_t', 'c_c', 'c_g', 'c_n', + 't_a', 't_t', 't_c', 't_g', 't_n'] + + headerlist = ['oligo', 'readcount'] + possibleconvs + with open(outfile, 'w') as outfh: + outfh.write(('\t').join(headerlist) + '\n') + for oligo in convs: + readcount = convs[oligo][0] + outfh.write(oligo + '\t' + str(readcount) + '\t') + for possibleconv in possibleconvs: + v = str(convs[oligo][1][possibleconv]) + if possibleconv != 't_n': + outfh.write(v + '\t') + elif possibleconv == 't_n': + outfh.write(v + '\n') + +if __name__ == '__main__': + parser = argparse.ArgumentParser(description = 'Count mismatches in MPRA sequencing data.') + parser.add_argument('--bam', type = str, help = 'Alignment file. Ideally from bowtie2.') + parser.add_argument('--onlyConsiderOverlap', action='store_true', + help='Only consider conversions seen in both reads of a read pair? Only possible with paired end data.') + parser.add_argument('--use_read1', action='store_true', + help='Use read1 when looking for conversions? Only useful with paired end data.') + parser.add_argument('--use_read2', action='store_true', + help='Use read2 when looking for conversions? Only useful with paired end data.') + parser.add_argument('--minMappingQual', type=int, + help='Minimum mapping quality for a read to be considered in conversion counting. bowtie2 unique mappers have MAPQ >=2.', required=True) + parser.add_argument('--minPhred', type = int, help = 'Minimum phred score for a nucleotide to be considered.') + parser.add_argument( + '--nConv', type=int, help='Minimum number of required T->C conversions in a read pair in order for those conversions to be counted. Default is 1.', default=1) + parser.add_argument('--output', type = str, help = 'Output file.') + args = parser.parse_args() + + convs = iteratereads_pairedend(args.bam, args.onlyConsiderOverlap, args.use_read1, args.use_read2, args.nConv, args.minMappingQual, args.minPhred) + writeOutput(convs, args.output) + + + From aa8e42c754135a39d0dcfba4208ce867d41b1dfb Mon Sep 17 00:00:00 2001 From: Matthew Taliaferro Date: Fri, 14 Feb 2025 18:02:00 -0700 Subject: [PATCH 071/108] add maxmap to alignandquant --- alignAndQuant.py | 3 +- assignreads_salmon_ensembl.py | 293 ++++++++++++++++++++++++++++++++++ 2 files changed, 295 insertions(+), 1 deletion(-) create mode 100644 assignreads_salmon_ensembl.py diff --git a/alignAndQuant.py b/alignAndQuant.py index 5636dd6..a1cedf8 100644 --- a/alignAndQuant.py +++ b/alignAndQuant.py @@ -228,7 +228,8 @@ def addMD(samplename, reffasta, nthreads): os.chdir(sampledir) runSTAR(r1, r2, nthreads, STARindex, samplename) - filterbam(samplename, maxmap) + if maxmap: + filterbam(samplename, maxmap) #aligned read files alignedr1 = samplename + '.aligned.r1.fq.gz' diff --git a/assignreads_salmon_ensembl.py b/assignreads_salmon_ensembl.py new file mode 100644 index 0000000..26cbe00 --- /dev/null +++ b/assignreads_salmon_ensembl.py @@ -0,0 +1,293 @@ +import os +import sys +import pysam +import pickle +import gffutils +import numpy as np + + +#Take in a dictionary of {readid : conversions} (made by getmismatches.py) and a postmaster-enhanced bam (made by alignAndQuant.py). +#First, construct dictionary of {readid : {txid : fractional assignment}}. Then, combining this dictionary with the previous one, +#count the number of conversions associated with each transcript. Finally (and I guess optionally), using a genome annotation file, +#collapse transcript level conversion counts to gene-level conversion counts. + +def getpostmasterassignments(postmasterbam): + #Given a postmaster-produced bam, make a dictionary of the form {readid : {txid : fractional assignment}} + #It looks like in a postmaster bam that paired end reads are right after each other and are always + #given the same fractional assignments. This means we can probably just consider R1 reads. + + pprobs = {} #{readid : {txid : pprob}} + + with pysam.AlignmentFile(postmasterbam, 'r') as bamfh: + for read in bamfh.fetch(until_eof = True): + if read.is_read2: + continue + readid = read.query_name + tx = read.reference_name.split('.')[0] + pprob = read.get_tag(tag='ZW') + if readid not in pprobs: + pprobs[readid] = {} + pprobs[readid][tx] = pprob + + return pprobs + +def assigntotxs(pprobs, convs): + #Intersect posterior probabilities of read assignments to transcripts with conversion counts of those reads. + #The counts assigned to a tx by a read are scaled by the posterior probability that a read came from that transcript. + + #pprobs = #{readid : {txid : pprob}} + #produced from getpostmasterassignments() + #convs = #{readid : {a_a : 200, a_t : 1, etc.}} + print('Finding transcript assignments for {0} reads.'.format(len(convs))) + readswithoutassignment = 0 #number of reads which exist in convs but not in pprobs (i.e. weren't assigned to a transcript by salmon) + assignedreads = 0 #number of reads in convs for which we found a match in pprobs + + txconvs = {} # {txid : {a_a : 200, a_t : 1, etc.}} + + for readid in pprobs: + + try: + readconvs = convs[readid] + assignedreads +=1 + except KeyError: #we couldn't find this read in convs + readswithoutassignment +=1 + continue + + for txid in pprobs[readid]: + txid = txid.split('.')[0] + if txid not in txconvs: + txconvs[txid] = {} + pprob = pprobs[readid][txid] + for conv in readconvs: + scaledconv = readconvs[conv] * pprob + if conv not in txconvs[txid]: + txconvs[txid][conv] = scaledconv + else: + txconvs[txid][conv] += scaledconv + + readswithtxs = len(convs) - readswithoutassignment + readswithtxs = assignedreads + pct = round(readswithtxs / len(convs), 2) * 100 + print('Found transcripts for {0} of {1} reads ({2}%).'.format(readswithtxs, len(convs), pct)) + + return txconvs + +def collapsetogene(txconvs, gff): + #Collapse tx-level count measurements to gene level. + #Need to relate transcripts and genes. Do that with the supplied gff annotation. + #txconvs = {txid : {a_a : 200, a_t : 1, etc.}} + + tx2gene = {} #{txid : geneid} + geneid2genename = {} #{geneid : genename} + geneconvs = {} # {geneid : {a_a : 200, a_t : 1, etc.}} + + print('Indexing gff..') + gff_fn = gff + db_fn = os.path.abspath(gff_fn) + '.db' + if os.path.isfile(db_fn) == False: + gffutils.create_db(gff_fn, db_fn, merge_strategy='merge', verbose=True) + print('Done indexing!') + + db = gffutils.FeatureDB(db_fn) + genes = db.features_of_type('gene') + + print('Connecting transcripts and genes...') + for gene in genes: + geneid = str(gene.id).split('.')[0].replace('gene:', '') #remove version numbers and gene: + #in the ensembl zebrafish gff, some genes don't have a Name attribute + try: + genename = gene.attributes['Name'][0] + except KeyError: + genename = geneid + geneid2genename[geneid] = genename + for tx in db.children(gene, level = 1): #allow feature types other than 'mRNA'. be flexible here. + txid = str(tx.id).split('.')[0].replace('transcript:', '') #remove version numbers and transcript: + tx2gene[txid] = geneid + print('Done!') + + allgenes = list(set(tx2gene.values())) + + #Initialize geneconvs dictionary + possibleconvs = [ + 'a_a', 'a_t', 'a_c', 'a_g', 'a_n', + 'g_a', 'g_t', 'g_c', 'g_g', 'g_n', + 'c_a', 'c_t', 'c_c', 'c_g', 'c_n', + 't_a', 't_t', 't_c', 't_g', 't_n', + 'a_x', 'g_x', 'c_x', 't_x', 'ng_xg'] + + for gene in allgenes: + geneconvs[gene] = {} + for conv in possibleconvs: + geneconvs[gene][conv] = 0 + + for tx in txconvs: + try: + gene = tx2gene[tx] + except KeyError: + print('WARNING: transcript {0} doesn\'t belong to a gene in the supplied annotation.'.format(tx)) + continue + convs = txconvs[tx] + for conv in convs: + convcount = txconvs[tx][conv] + geneconvs[gene][conv] += convcount + + return tx2gene, geneid2genename, geneconvs + +def readspergene(quantsf, tx2gene): + #Get the number of reads assigned to each tx. This can simply be read from the salmon quant.sf file. + #Then, sum read counts across all transcripts within a gene. + #Transcript and gene relationships were derived by collapsetogene(). + + txcounts = {} #{txid : readcounts} + genecounts = {} #{geneid : readcounts} + + with open(quantsf, 'r') as infh: + for line in infh: + line = line.strip().split('\t') + if line[0] == 'Name': + continue + txid = line[0].split('.')[0] #remove tx id version in the salmon quant.sf if it exists + counts = float(line[4]) + txcounts[txid] = counts + + allgenes = list(set(tx2gene.values())) + for gene in allgenes: + genecounts[gene] = 0 + + for txid in txcounts: + try: + geneid = tx2gene[txid] + except KeyError: #maybe the salmon tx id have version numbers + txid = txid.split('.')[0] + geneid = tx2gene[txid] + + genecounts[geneid] += txcounts[txid] + + return genecounts + + +def writeOutput(sampleparams, geneconvs, genecounts, geneid2genename, outfile, use_g_t, use_g_c, use_g_x, use_ng_xg): + #Write number of conversions and readcounts for genes. + possibleconvs = [ + 'a_a', 'a_t', 'a_c', 'a_g', 'a_n', + 'g_a', 'g_t', 'g_c', 'g_g', 'g_n', + 'c_a', 'c_t', 'c_c', 'c_g', 'c_n', + 't_a', 't_t', 't_c', 't_g', 't_n', + 'a_x', 'g_x', 'c_x', 't_x', 'ng_xg'] + + with open(outfile, 'w') as outfh: + #Write arguments for this pigpen run + for arg in sampleparams: + outfh.write('#' + arg + '\t' + str(sampleparams[arg]) + '\n') + #total G is number of ref Gs encountered + #convG is g_t + g_c + g_x + ng_xg (the ones we are interested in) + outfh.write(('\t').join(['GeneID', 'GeneName', 'numreads'] + possibleconvs + [ + 'totalG', 'convG', 'convGrate', 'G_Trate', 'G_Crate', 'G_Xrate', 'NG_XGrate', 'porc']) + '\n') + genes = sorted(geneconvs.keys()) + + for gene in genes: + genename = geneid2genename[gene] + numreads = genecounts[gene] + convcounts = [] + c = geneconvs[gene] + for conv in possibleconvs: + convcount = c[conv] + convcounts.append(convcount) + + convcounts = ['{:.2f}'.format(x) for x in convcounts] + + totalG = c['g_g'] + c['g_c'] + c['g_t'] + c['g_a'] + c['g_n'] + c['g_x'] + convG = 0 + possiblegconv = ['g_t', 'g_c', 'g_x', 'ng_xg'] + for ind, x in enumerate([use_g_t, use_g_c, use_g_x, use_ng_xg]): + if x == True: + convG += c[possiblegconv[ind]] + + g_ccount = c['g_c'] + g_tcount = c['g_t'] + g_xcount = c['g_x'] + ng_xgcount = c['ng_xg'] + + totalmut = c['a_t'] + c['a_c'] + c['a_g'] + c['g_t'] + c['g_c'] + c['g_a'] + c['t_a'] + c['t_c'] + c['t_g'] + c['c_t'] + c['c_g'] + c['c_a'] + c['g_x'] + c['ng_xg'] + totalnonmut = c['a_a'] + c['g_g'] + c['c_c'] + c['t_t'] + allnt = totalmut + totalnonmut + + try: + convGrate = convG / totalG + except ZeroDivisionError: + convGrate = 'NA' + + try: + g_crate = g_ccount / totalG + except ZeroDivisionError: + g_crate = 'NA' + + try: + g_trate = g_tcount / totalG + except ZeroDivisionError: + g_trate = 'NA' + + try: + g_xrate = g_xcount / totalG + except ZeroDivisionError: + g_xrate = 'NA' + + try: + ng_xgrate = ng_xgcount / totalG + except ZeroDivisionError: + ng_xgrate = 'NA' + + try: + totalmutrate = totalmut / allnt + except ZeroDivisionError: + totalmutrate = 'NA' + + #normalize convGrate to rate of all mutations + #Proportion Of Relevant Conversions + if totalmutrate == 'NA': + porc = 'NA' + elif totalmutrate > 0: + try: + porc = np.log2(convGrate / totalmutrate) + except: + porc = 'NA' + else: + porc = 'NA' + + #Format numbers for printing + if type(numreads) == float: + numreads = '{:.2f}'.format(numreads) + if type(convG) == float: + convG = '{:.2f}'.format(convG) + if type(totalG) == float: + totalG = '{:.2f}'.format(totalG) + if type(convGrate) == float: + convGrate = '{:.2e}'.format(convGrate) + if type(g_trate) == float: + g_trate = '{:.2e}'.format(g_trate) + if type(g_crate) == float: + g_crate = '{:.2e}'.format(g_crate) + if type(g_xrate) == float: + g_xrate = '{:.2e}'.format(g_xrate) + if type(ng_xgrate) == float: + ng_xgrate = '{:.2e}'.format(ng_xgrate) + if type(porc) == np.float64: + porc = '{:.3f}'.format(porc) + + outfh.write(('\t').join([gene, genename, str(numreads)] + convcounts + [str(totalG), str(convG), str(convGrate), str(g_trate), str(g_crate), str(g_xrate), str(ng_xgrate), str(porc)]) + '\n') + +if __name__ == '__main__': + print('Getting posterior probabilities from salmon alignment file...') + pprobs = getpostmasterassignments(sys.argv[1]) + print('Done!') + print('Loading conversions from pickle file...') + with open(sys.argv[2], 'rb') as infh: + convs = pickle.load(infh) + print('Done!') + print('Assinging conversions to transcripts...') + txconvs = assigntotxs(pprobs, convs) + print('Done!') + + tx2gene, geneid2genename, geneconvs = collapsetogene(txconvs, sys.argv[3]) + genecounts = readspergene(sys.argv[4], tx2gene) + writeOutput(geneconvs, genecounts, geneid2genename, sys.argv[5], True, True) \ No newline at end of file From ee75f48564f6da32b67d21b45dcc5f3335b0cffb Mon Sep 17 00:00:00 2001 From: Matthew Taliaferro Date: Sat, 15 Feb 2025 09:36:38 -0700 Subject: [PATCH 072/108] add argument for source of GFF --- pigpen.py | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/pigpen.py b/pigpen.py index 215c774..9b2b6ee 100644 --- a/pigpen.py +++ b/pigpen.py @@ -8,7 +8,6 @@ from snps import getSNPs, recordSNPs from maskpositions import readmaskbed from getmismatches import iteratereads_pairedend, getmismatches -#from assignreads_salmon import getpostmasterassignments, assigntotxs, collapsetogene, readspergene, writeOutput from assignreads_salmon_ensembl import getpostmasterassignments, assigntotxs, collapsetogene, readspergene, writeOutput from assignreads import getReadOverlaps, processOverlaps from conversionsPerGene import getPerGene, writeConvsPerGene @@ -19,6 +18,7 @@ parser.add_argument('--samplenames', type = str, help = 'Comma separated list of samples to quantify.', required = True) parser.add_argument('--controlsamples', type = str, help = 'Comma separated list of control samples (i.e. those where no *induced* conversions are expected). May be a subset of samplenames. Required if SNPs are to be considered and a snpfile is not supplied.') parser.add_argument('--gff', type = str, help = 'Genome annotation in gff format.') + parser.add_argument('--gfftype', type = str, help = 'Source of genome annotation file.', choices = ['GENCODE', 'Ensembl'], required = True) parser.add_argument('--genomeFasta', type = str, help = 'Genome sequence in fasta format. Required if SNPs are to be considered.') parser.add_argument('--nproc', type = int, help = 'Number of processors to use. Default is 1.', default = 1) parser.add_argument('--useSNPs', action = 'store_true', help = 'Consider SNPs?') @@ -40,6 +40,12 @@ parser.add_argument('--outputDir', type = str, help = 'Output directory.', required = True) args = parser.parse_args() + #What type of gff are we working with? + if args.gfftype == 'GENCODE': + from assignreads_salmon import getpostmasterassignments, assigntotxs, collapsetogene, readspergene, writeOutput + elif args.gfftype == 'Ensembl': + from assignreads_salmon_ensembl import getpostmasterassignments, assigntotxs, collapsetogene, readspergene, writeOutput + #If we have single end data, considering overlap of paired reads or only one read doesn't make sense if args.datatype == 'single': args.onlyConsiderOverlap = False From 46b8c6dca1f340ef264327dea748a1145ab435d4 Mon Sep 17 00:00:00 2001 From: Matthew Taliaferro Date: Sat, 15 Feb 2025 09:46:32 -0700 Subject: [PATCH 073/108] small update to pigpen.py --- pigpen.py | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/pigpen.py b/pigpen.py index 9b2b6ee..8d27622 100644 --- a/pigpen.py +++ b/pigpen.py @@ -8,7 +8,6 @@ from snps import getSNPs, recordSNPs from maskpositions import readmaskbed from getmismatches import iteratereads_pairedend, getmismatches -from assignreads_salmon_ensembl import getpostmasterassignments, assigntotxs, collapsetogene, readspergene, writeOutput from assignreads import getReadOverlaps, processOverlaps from conversionsPerGene import getPerGene, writeConvsPerGene @@ -78,8 +77,8 @@ controlsamplebams.append(starbams[ind]) #We have to be either looking for G->T or G->C, if not both - if not args.use_g_t and not args.use_g_c: - print('We have to either be looking for G->T or G->C, if not both! Add argument --use_g_t and/or --use_g_c.') + if not args.use_g_t and not args.use_g_c and not args.use_g_x: + print('We have to either be looking for G->T or G->C or G->del, if not both! Add argument --use_g_t and/or --use_g_c and/or --use_g_x.') sys.exit() #We have to be using either read1 or read2 if not both From 1f1b01444d8968acb1e51a6acd224d6a16e79b7f Mon Sep 17 00:00:00 2001 From: Matthew Taliaferro Date: Mon, 17 Feb 2025 13:44:36 -0700 Subject: [PATCH 074/108] update readme and add setup.py --- README.md | 7 ++++--- setup.py | 8 ++++++++ 2 files changed, 12 insertions(+), 3 deletions(-) create mode 100644 setup.py diff --git a/README.md b/README.md index c72dc69..8fa241c 100644 --- a/README.md +++ b/README.md @@ -16,7 +16,7 @@ ## Overview -OINC-seq (Oxidation-Induced Nucleotide Conversion sequencing) is a sequencing technology that allows the direction of oxidative marks on RNA molecules. Because guanosine has the lowest redox potential of any of the ribonucleosides, it is the one most likely to be affected by oxidation. When this occurs, guanosine is turned into 8-oxoguanosine (8-OG). A previous [study](https://pubs.acs.org/doi/10.1021/acs.biochem.7b00730) found that when reverse transcriptase encounters guanosine oxidation products, it can misinterpret 8-OG as either T or C. Therefore, to detect these oxidative marks, one can look for G -> T and G -> C conversions in RNAseq data. +[OINC-seq](https://www.biorxiv.org/content/10.1101/2024.11.12.623278v1.abstract) (Oxidation-Induced Nucleotide Conversion sequencing) is a sequencing technology that allows the direction of oxidative marks on RNA molecules. Because guanosine has the lowest redox potential of any of the ribonucleosides, it is the one most likely to be affected by oxidation. When this occurs, guanosine is turned into 8-oxoguanosine (8-OG) or further oxidized products. When reverse transcriptase encounters these products, it makes predictable errors in the resulting cDNA (see [here](https://pmc.ncbi.nlm.nih.gov/articles/PMC5623583/)). OINC-seq employs spatially restricted singlet oxygen radicals to oxidize RNAs at specific subcellular locations. The level of RNA oxidation detected for each RNA species is therefore a readout of the amount of that RNA species at that subcellular location. To detect and quantify these conversions, we have created software called **PIGPEN** (Pipeline for Identification of Guanosine Positions Erroneously Notated). @@ -41,6 +41,7 @@ PIGPEN has the following prerequisites: - pandas >= 1.3.5 - bamtools >= 2.5.2 - salmon >= 1.9.0 +- STAR >= 2.7.10 - gffutils >= 0.11.0 - umi_tools >= 1.1.0 (if UMI collapsing is desired) - [postmaster](https://github.com/COMBINE-lab/postmaster) @@ -56,7 +57,7 @@ BACON has the following prerequisites: ## Installation -For now, installation can be done by cloning this repository. As PIPGEN matures, we will work towards getting this package on [bioconda](https://bioconda.github.io/). +Installation can be done by cloning this repository. Alternatively PIGPEN can be installed using [bioconda](https://bioconda.github.io/). In either case, `postmaster` must be installed separately afterward. This can be done using `cargo install --git https://github.com/COMBINE-lab/postmaster` ## Preparing alignment files @@ -138,7 +139,7 @@ After PIGPEN calculates the number of converted and noncoverted nucleotides in e We have observed that the overall rate of conversions (not just G -> T + G -> C, but all conversions) can vary signficantly from sample to sample, presumably due to a technical effect in library preparation. For this reason, PIGPEN calculates **PORC** (Proportion of Relevant Conversions) values. This is the log2 ratio of the relevant conversion rate ([G -> T + G -> C] / total number of reference G encountered) to the overall conversion rate (total number of all conversions / total number of positions interrogated). PORC therefore normalizes to the overall rate of conversions, removing this technical effect. -PIGPEN can use G -> T conversions, G -> C conversions, or both when calculating PORC values. This behavior is controlled by supplying the options `--use_g_t` and `--use_g_c`. To consider both types of conversions, supply both flags. +PIGPEN can use G -> T conversions, G -> C conversions, G deletions, or any combination when calculating PORC values. This behavior is controlled by supplying some or all of the options `--use_g_t`, `--use_g_c`, and `--use_g_x`, respectively. ## Using one read of a paired end sample diff --git a/setup.py b/setup.py new file mode 100644 index 0000000..82cdd58 --- /dev/null +++ b/setup.py @@ -0,0 +1,8 @@ +from distutils.core import setup +setup(name = 'pigpen', +description = 'Pipeline for the Identification of Guanosine Positions Erroneously Notated', +author = 'Matthew Taliaferro', +author_email = 'taliaferrojm@gmail.com', +url = 'https://github.com/TaliaferroLab/OINC-seq', +version = '0.0.2', +scripts = ['ExtractUMI.py', 'alignAndQuant.py', 'alignUMIquant.py', 'assignreads.py', 'assignreads_salmon.py', 'assignreads_salmon_ensembl.py', 'bacon_glm.py', 'conversionsPerGene.py', 'filterbam.py', 'getmismatches.py', 'getmismatches_MPRA.py', 'maskpositions.py', 'parsebamreadcount.py', 'pigpen.py', 'snps.py']) From 96a01620efa22fab0c87e82907b7eda981c40f62 Mon Sep 17 00:00:00 2001 From: Matthew Taliaferro Date: Mon, 17 Feb 2025 14:11:02 -0700 Subject: [PATCH 075/108] update setup.py with package excluding --- setup.py | 1 + 1 file changed, 1 insertion(+) diff --git a/setup.py b/setup.py index 82cdd58..48e0e55 100644 --- a/setup.py +++ b/setup.py @@ -5,4 +5,5 @@ author_email = 'taliaferrojm@gmail.com', url = 'https://github.com/TaliaferroLab/OINC-seq', version = '0.0.2', +packages = find_packages('.', exclude = ['workflow', 'testdata']), scripts = ['ExtractUMI.py', 'alignAndQuant.py', 'alignUMIquant.py', 'assignreads.py', 'assignreads_salmon.py', 'assignreads_salmon_ensembl.py', 'bacon_glm.py', 'conversionsPerGene.py', 'filterbam.py', 'getmismatches.py', 'getmismatches_MPRA.py', 'maskpositions.py', 'parsebamreadcount.py', 'pigpen.py', 'snps.py']) From ec9d46a544cfc00f554937d3b5a2e3493539a3c9 Mon Sep 17 00:00:00 2001 From: Matthew Taliaferro Date: Mon, 17 Feb 2025 14:21:17 -0700 Subject: [PATCH 076/108] change from distutils to setuptools --- setup.py | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/setup.py b/setup.py index 48e0e55..dacf1f8 100644 --- a/setup.py +++ b/setup.py @@ -1,4 +1,5 @@ -from distutils.core import setup +#from distutils.core import setup +from setuptools import setup, find_packages setup(name = 'pigpen', description = 'Pipeline for the Identification of Guanosine Positions Erroneously Notated', author = 'Matthew Taliaferro', From 16a38a0686c2d5d1c306b8fc29fe7e97f191beea Mon Sep 17 00:00:00 2001 From: Matthew Taliaferro Date: Mon, 17 Feb 2025 14:30:33 -0700 Subject: [PATCH 077/108] add license --- LICENSE | 21 +++++++++++++++++++++ 1 file changed, 21 insertions(+) create mode 100644 LICENSE diff --git a/LICENSE b/LICENSE new file mode 100644 index 0000000..88f1777 --- /dev/null +++ b/LICENSE @@ -0,0 +1,21 @@ +MIT License + +Copyright (c) 2025 Taliaferro lab + +Permission is hereby granted, free of charge, to any person obtaining a copy +of this software and associated documentation files (the "Software"), to deal +in the Software without restriction, including without limitation the rights +to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +copies of the Software, and to permit persons to whom the Software is +furnished to do so, subject to the following conditions: + +The above copyright notice and this permission notice shall be included in all +copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +SOFTWARE. From 38c5eb46d50534dd9d4d33e5efc8891c1934629f Mon Sep 17 00:00:00 2001 From: Matthew Taliaferro Date: Mon, 17 Feb 2025 16:01:43 -0700 Subject: [PATCH 078/108] add shebang to pigpen.py --- pigpen.py | 2 ++ 1 file changed, 2 insertions(+) diff --git a/pigpen.py b/pigpen.py index 8d27622..6b9b089 100644 --- a/pigpen.py +++ b/pigpen.py @@ -1,3 +1,5 @@ +#!/usr/bin/env python + #Pipeline for Identification of Guanosine Positions Erroneously Notated #PIGPEN From baf037a29b6c6dbd8834712f76808ea81c923995 Mon Sep 17 00:00:00 2001 From: Matthew Taliaferro Date: Tue, 18 Feb 2025 13:33:40 -0700 Subject: [PATCH 079/108] reorganize into src directory --- README.md | 1 + setup.py | 2 +- ExtractUMI.py => src/ExtractUMI.py | 0 alignAndQuant.py => src/alignAndQuant.py | 0 alignAndQuant2.py => src/alignAndQuant2.py | 0 alignUMIquant.py => src/alignUMIquant.py | 0 assignreads.py => src/assignreads.py | 0 .../assignreads_salmon.py | 0 .../assignreads_salmon_ensembl.py | 0 bacon_glm.py => src/bacon_glm.py | 0 src/bacon_subsample.py | 141 ++++++++++++++++++ .../conversionsPerGene.py | 0 filterbam.py => src/filterbam.py | 0 getmismatches.py => src/getmismatches.py | 0 .../getmismatches_MPRA.py | 0 maskpositions.py => src/maskpositions.py | 0 .../parsebamreadcount.py | 0 pigpen.py => src/pigpen.py | 0 src/simulateOINCreads.py | 134 +++++++++++++++++ snps.py => src/snps.py | 0 toy.bam | Bin 2732795 -> 0 bytes 21 files changed, 277 insertions(+), 1 deletion(-) rename ExtractUMI.py => src/ExtractUMI.py (100%) rename alignAndQuant.py => src/alignAndQuant.py (100%) rename alignAndQuant2.py => src/alignAndQuant2.py (100%) rename alignUMIquant.py => src/alignUMIquant.py (100%) rename assignreads.py => src/assignreads.py (100%) rename assignreads_salmon.py => src/assignreads_salmon.py (100%) rename assignreads_salmon_ensembl.py => src/assignreads_salmon_ensembl.py (100%) rename bacon_glm.py => src/bacon_glm.py (100%) create mode 100644 src/bacon_subsample.py rename conversionsPerGene.py => src/conversionsPerGene.py (100%) rename filterbam.py => src/filterbam.py (100%) rename getmismatches.py => src/getmismatches.py (100%) rename getmismatches_MPRA.py => src/getmismatches_MPRA.py (100%) rename maskpositions.py => src/maskpositions.py (100%) rename parsebamreadcount.py => src/parsebamreadcount.py (100%) rename pigpen.py => src/pigpen.py (100%) create mode 100644 src/simulateOINCreads.py rename snps.py => src/snps.py (100%) delete mode 100644 toy.bam diff --git a/README.md b/README.md index 8fa241c..9b6593a 100644 --- a/README.md +++ b/README.md @@ -38,6 +38,7 @@ PIGPEN has the following prerequisites: - bcftools >= 1.15 - pysam >= 0.19 - numpy >= 1.21 +- pybedtools >= 0.9.0 - pandas >= 1.3.5 - bamtools >= 2.5.2 - salmon >= 1.9.0 diff --git a/setup.py b/setup.py index dacf1f8..43ece6b 100644 --- a/setup.py +++ b/setup.py @@ -6,5 +6,5 @@ author_email = 'taliaferrojm@gmail.com', url = 'https://github.com/TaliaferroLab/OINC-seq', version = '0.0.2', -packages = find_packages('.', exclude = ['workflow', 'testdata']), +packages = find_packages('src', exclude = ['workflow', 'testdata']), scripts = ['ExtractUMI.py', 'alignAndQuant.py', 'alignUMIquant.py', 'assignreads.py', 'assignreads_salmon.py', 'assignreads_salmon_ensembl.py', 'bacon_glm.py', 'conversionsPerGene.py', 'filterbam.py', 'getmismatches.py', 'getmismatches_MPRA.py', 'maskpositions.py', 'parsebamreadcount.py', 'pigpen.py', 'snps.py']) diff --git a/ExtractUMI.py b/src/ExtractUMI.py similarity index 100% rename from ExtractUMI.py rename to src/ExtractUMI.py diff --git a/alignAndQuant.py b/src/alignAndQuant.py similarity index 100% rename from alignAndQuant.py rename to src/alignAndQuant.py diff --git a/alignAndQuant2.py b/src/alignAndQuant2.py similarity index 100% rename from alignAndQuant2.py rename to src/alignAndQuant2.py diff --git a/alignUMIquant.py b/src/alignUMIquant.py similarity index 100% rename from alignUMIquant.py rename to src/alignUMIquant.py diff --git a/assignreads.py b/src/assignreads.py similarity index 100% rename from assignreads.py rename to src/assignreads.py diff --git a/assignreads_salmon.py b/src/assignreads_salmon.py similarity index 100% rename from assignreads_salmon.py rename to src/assignreads_salmon.py diff --git a/assignreads_salmon_ensembl.py b/src/assignreads_salmon_ensembl.py similarity index 100% rename from assignreads_salmon_ensembl.py rename to src/assignreads_salmon_ensembl.py diff --git a/bacon_glm.py b/src/bacon_glm.py similarity index 100% rename from bacon_glm.py rename to src/bacon_glm.py diff --git a/src/bacon_subsample.py b/src/bacon_subsample.py new file mode 100644 index 0000000..6f5a7bb --- /dev/null +++ b/src/bacon_subsample.py @@ -0,0 +1,141 @@ +#As a statistical framework for identifying genes with differing 8OG-mediated conversion rates +#across conditions, use a subsampling approach. For each gene, subsample the reads assigned to it, +#calculating a porc value for each subsample. So then you end up with a distribution of porc +#values for each gene in each sample. + +#This relies on two pickled dictionaries produced by pigpen.py: read2gene.pkl and readconvs.pkl. +#The first is a dictionary of the form {readid : ensembl_gene_id}. +#The second is a dictionary of the form {readid : {convs}} +#where {convs} is of the form {x_y : count} where x is the reference sequence and y is the query sequence. +#x is one of 'agct' and y is one of 'agctn'. + +#Then, using Hoteling t-test, compare distributions of porc values across conditions. + +import pickle +import numpy as np +import random +import sys +from collections import defaultdict + +def getReadsperGene(read2genepkl): + #pigpen produces a dictionary of the form {readid : ensembl_gene_id} (assignreads.processOverlaps) + #It has written this dictionary to a pickled file. + #We need a dictionary of the form {ensembl_gene_id : [readids]} + + print('Loading gene/read assignments...') + with open(read2genepkl, 'rb') as infh: + read2gene = pickle.load(infh) + print('Done!') + + readspergene = {} #{ensembl_gene_id : [readids that belong to this gene]} + + readcount = 0 + for read in read2gene: + gene = read2gene[read] + if gene not in readspergene: + readspergene[gene] = [read] + else: + readspergene[gene].append(read) + + return readspergene + +def makeGeneConvdict(readspergene, readconvspkl): + #Want to make a dictionary that looks like this + #{gene : [{convs1}, {convs2}, ...]} where each conv dict corresponds to one read + #that has been assigned to this gene + + print('Loading read conversion info...') + with open(readconvspkl, 'rb') as infh: + readconvs = pickle.load(infh) + print('Done!') + + print('Making gene : read conversion dictionary...') + geneconv = defaultdict(list) + for gene in readspergene: + reads = readspergene[gene] + for read in reads: + try: + geneconv[gene].append(readconvs[read]) + except KeyError: + pass #a read for which we calculated conversions but didn't get assigned to any read + print('Done!') + + return geneconv + +def calcPORC(convs): + #accepting a list of conversion dictionaries + + convGcount = 0 + totalGcount = 0 + allconvcount = 0 + allnonconvcount = 0 + + for conv in convs: + convG = conv['g_t'] + conv['g_c'] + totalG = conv['g_t'] + conv['g_c'] + \ + conv['g_a'] + conv['g_n'] + conv['g_g'] + + allconv = conv['a_t'] + conv['a_c'] + conv['a_g'] + conv['g_t'] + conv['g_c'] + \ + conv['g_a'] + conv['t_a'] + conv['t_c'] + \ + conv['t_g'] + conv['c_t'] + conv['c_g'] + conv['c_a'] + \ + conv['a_n'] + conv['g_n'] + conv['c_n'] + conv['t_n'] + + allnonconv = conv['a_a'] + conv['g_g'] + conv['c_c'] + conv['t_t'] + + convGcount += convG + totalGcount += totalG + allconvcount += allconv + allnonconvcount += allnonconv + + allnt = allconvcount + allnonconvcount + try: + convGrate = convGcount / totalGcount + except ZeroDivisionError: + convGrate = np.nan + + try: + totalmutrate = allconvcount / allnt + except ZeroDivisionError: + totalmutrate = np.nan + + #Calculate porc + if totalmutrate == np.nan: + porc = np.nan + elif totalmutrate > 0: + try: + porc = np.log2(convGrate / totalmutrate) + except: + porc = np.nan + else: + porc = np.nan + + return porc + +def subsamplegeneconv(geneconv, subsamplesize, n_subsamples): + subsampledporcs = defaultdict(list) # {ensembl_gene_id : [porc values]} + + for gene in geneconv: + print(gene) + convs = geneconv[gene] + nreads = len(convs) + n_readstosubsample = int(nreads * subsamplesize) + for i in range(n_subsamples): + subsampledconvs = random.sample(convs, n_readstosubsample) + porc = calcPORC(subsampledconvs) + print(porc) + subsampledporcs[gene].append(porc) + + return subsampledporcs + +#Take in a sampconds, calculate subsamples for each, end up with a dictionary like so: +#{gene : {condition : [[subsampled porcs sample 1], [subsampled porcs sample 2], ...]}} + + + + +if __name__ == '__main__': + + readspergene = getReadsperGene(sys.argv[1]) + geneconv = makeGeneConvdict(readspergene, sys.argv[2]) + subsamplegeneconv(geneconv, 0.3, 100) + diff --git a/conversionsPerGene.py b/src/conversionsPerGene.py similarity index 100% rename from conversionsPerGene.py rename to src/conversionsPerGene.py diff --git a/filterbam.py b/src/filterbam.py similarity index 100% rename from filterbam.py rename to src/filterbam.py diff --git a/getmismatches.py b/src/getmismatches.py similarity index 100% rename from getmismatches.py rename to src/getmismatches.py diff --git a/getmismatches_MPRA.py b/src/getmismatches_MPRA.py similarity index 100% rename from getmismatches_MPRA.py rename to src/getmismatches_MPRA.py diff --git a/maskpositions.py b/src/maskpositions.py similarity index 100% rename from maskpositions.py rename to src/maskpositions.py diff --git a/parsebamreadcount.py b/src/parsebamreadcount.py similarity index 100% rename from parsebamreadcount.py rename to src/parsebamreadcount.py diff --git a/pigpen.py b/src/pigpen.py similarity index 100% rename from pigpen.py rename to src/pigpen.py diff --git a/src/simulateOINCreads.py b/src/simulateOINCreads.py new file mode 100644 index 0000000..52b013d --- /dev/null +++ b/src/simulateOINCreads.py @@ -0,0 +1,134 @@ +#python >=3.6 +import random +import sys +import gzip + +#usage: python simulateOINCreads.py + +#This is the wildtype sequence of the amplicon. Paired end reads will read in from both ends. +seq = 'ACAGTCCATGCCATCACTGCCACCCAGAAGACTGTGGATGGCCCCTCCGGGAAACTGTGGCGTGATGGCCGCGGGGCTCTCCAGAACATCATCCCTGCCTCTACTGGCGCTGCCAAGGCTGTGGGCAAGGTCATCCCTGAGCTGAACGGGAAGCTCACTGGCATGGCCTTCCGTGTCCCCACTGCCAACGTGTCAGTGGTGGACCTGACCTGCCGTCTAGAAAAACCTGCCAAATATGATGACATCAAGAAGGTGGTGAAGCAGGCGTCGGAGGGCCCCCTCAAGGGCATCCTGGGCTACACTGAGCACCAGGTGGTCTCCTCTGACTTCAACAGCGACACCCACTCCTCCACCTTTGACGCTGGGGCTGGCATTGCCCTCAACGACCACTTTGTCAAGCTC' +#first 200 nt of above seq so that we can incorporate read overlap +seq = 'ACAGTCCATGCCATCACTGCCACCCAGAAGACTGTGGATGGCCCCTCCGGGAAACTGTGGCGTGATGGCCGCGGGGCTCTCCAGAACATCATCCCTGCCTCTACTGGCGCTGCCAAGGCTGTGGGCAAGGTCATCCCTGAGCTGAACGGGAAGCTCACTGGCATGGCCTTCCGTGTCCCCACTGCCAACGTGTCAGTGGT' + + +#Intrinsic (i.e. cell- or RT-derived) mutation rates +mutfreqs = {'A' : {'C' : 1e-5, 'T' : 0, 'G' : 2e-4}, +'C' : {'G' : 5e-5, 'T' : 1.5e-3, 'A' : 4e-4}, +'G' : {'C' : 3e-5, 'T' : 2e-6, 'A' : 8e-4}, +'T' : {'A' : 3e-6, 'C' : 7e-5, 'G' : 8e-6}} + +#Sequencing error rate +seqerrorrate = 0.001 + + +def revcomp(nt): + revcompdict = { + 'G': 'C', + 'C': 'G', + 'A': 'T', + 'T': 'A', + 'N': 'N', + 'g': 'c', + 'c': 'g', + 'a': 't', + 't': 'a', + 'n': 'n' + } + + nt_rc = revcompdict[nt] + + return nt_rc + +def makecDNAseq(wtseq, mutfreqs): + #Make the sequence of the cDNA for this read pair. + #This is intended to be the entire fragment. We will break it up + #into the read pairs (and incorporate sequencing errors) later. + #build sequence nt by nt + outseq = '' + for nt in wtseq: + possiblents = list(mutfreqs[nt].keys()) + possiblentfreqs = list(mutfreqs[nt].values()) + wtfreq = 1 - sum(possiblentfreqs) + #add chance nt will be wt + possiblents.append(nt) + possiblentfreqs.append(wtfreq) + + outnt = random.choices( + population = possiblents, + weights = possiblentfreqs, + k = 1 + ) + outnt = outnt[0] + outseq += outnt + + return outseq + +def addseqerrors(readseq, seqerrorrate): + #Given a read sequence, add simulated sequencing erros + outseq = '' + for nt in readseq: + #is there a sequencing error at this position? + possiblents = list(mutfreqs[nt].keys()) + possiblentfreqs = [seqerrorrate / 3] * 3 + #add chance of not having a sequencing error + possiblents.append(nt) + possiblentfreqs.append(1 - seqerrorrate) + + outnt = random.choices( + population = possiblents, + weights = possiblentfreqs, + k = 1 + ) + outnt = outnt[0] + outseq += outnt + + return outseq + +def makefastq(seq, mutfreqs, seqerrorrate, readlength, depth, outfile): + #Given a wildtype amplicon (seq), paired end read length, desired depth, make simulated reads + #with desired mutation and sequencing error rates + #all quality scores are J (41) + readlength = int(readlength) + depth = int(depth) + + with gzip.open(outfile + '_1.fq.gz', 'wt') as read1outfh, gzip.open(outfile + '_2.fq.gz', 'wt') as read2outfh: + readcounter = 0 + for i in range(depth): + readcounter +=1 + if readcounter % 100000 == 0: + print('Creating read {0}...'.format(readcounter)) + + #Make fragment for this readpair + fragseq = makecDNAseq(seq, mutfreqs) + #Make reads from this fragment + read1seq = fragseq[:readlength] + read2seq = fragseq[readlength * -1:] + #reverse complement read2 + read2seq_rc = '' + for nt in read2seq: + nt_rc = revcomp(nt) + read2seq_rc += nt_rc + read2seq_rc = read2seq_rc[::-1] + + #Add sequencing errors + read1seq = addseqerrors(read1seq, seqerrorrate) + read2seq = addseqerrors(read2seq_rc, seqerrorrate) + + qualityscores = 'J' * len(read1seq) + readtitle = '@simread_' + str(readcounter) + + read1outfh.write(readtitle + '\n' + read1seq + '\n' + '+' + '\n' + qualityscores + '\n') + read2outfh.write(readtitle + '\n' + read2seq + '\n' + '+' + '\n' + qualityscores + '\n') + + with open('simulationparams.txt', 'w') as outfh: + outfh.write(('\t').join(['ref', 'mut', 'freq']) + '\n') + for ref in mutfreqs: + for mut in mutfreqs[ref]: + freq = str(mutfreqs[ref][mut]) + outfh.write(('\t').join([ref, mut, freq]) + '\n') + + +makefastq(seq, mutfreqs, seqerrorrate, sys.argv[1], sys.argv[2], 'oincsimulation') + + + diff --git a/snps.py b/src/snps.py similarity index 100% rename from snps.py rename to src/snps.py diff --git a/toy.bam b/toy.bam deleted file mode 100644 index 18b41f7f5e942cc56ac5c81bdd29c83406349589..0000000000000000000000000000000000000000 GIT binary patch literal 0 HcmV?d00001 literal 2732795 zcmV)SK(fCdiwFb&00000{{{d;LjnK{1Z9*@Xx&8+$0vEdJd;X=;z7K4sDh%_ot@p; z-5^wRu>NTRwH^{FG};L9P&7TLC=tAA@unyRK?E(<3JQvfLd9DU!Gb+##e+vd(OY|| zxcmF<{@(27@E*+fvokw0JHJ_aV0q^EIe`0*9+^M7vUpv%~}T1!Hi7>!IaGc-Aat(;>oLDztouy?w873>-ilvBF7zH0!Zq4)x_m8;$54i&{>q{mtp37s(<p&LIh%YEiD+IMvim>@w2I zx(4}DcAVxKxspCRZtimVX|(5E$d?RR=?3MxoF-qtHTlwg>m4?z+!Y73$ zNR>_DH5u2iVM5|=vkeZ|&={*1mH=QJe`zbiiI30j$%clVpD!Wnah_dICAiZ z!9w`w(u0Gg;lJ^ngGDgezGtuuZ2Y!0SQb8fi{^^OaPw=I$Hj6m**G^?9_}4&4pxA@ z$Ip#&VyiIv7E8@*@XenujI*%}=MMtxiN0V<{!a7vB}(yM+v|yfl(h9053UZB|Oj z@xq#POluc@dAX8Os(*Y{N%50_hd-#K_$9!<9`%t|Tse*cw`lv4H=)~iz~`!}Ddq?G+l-_5}PlHnnx zRSN(BABzYC000000RIL6LPG)osy^j?3#=^3Szh0J_i=VTpsjX!QlUfA)lBc!V7xo; zmyB`sb~cS2JC?6WFtRY_VVe*I^U|VVl&y9tsVK5+Sg;HTf?`Jr;DAK{K?2xB1}g|` z2R|UP6>JQNWfEj0+d{;!`Twf!neLvRK6B2UvwIDF&zVO}&)n(n{O|f7k6iQY@sZzn zmM5O@o_@o7i}xnCPS8rCE$z;6TjF*e$ zd@~=9CUfsacX-~n!}<5!>3J!oCMAYuA~%|(xy~pNB4LU|9G*lv6{b8=lS(O=W=bSN zFe402CDEyG49&R?!#IwIr$2&Tzuyaoy|CAh!+6;1#l2oX?DvCV7{dCn2mdLo1aKa{ z0(Y^mU2Vm{9hmc%u zCgbsVGaWCdoB49FT6tf0kLT5Zk*&axT(d|sLxo9Iq%$TmK_xX=#0<}QY78vG$Ls;v zjb?n_2_2sh?>&yoCqeDCIIIBpcPxOnK+yH!4jDOQkRhP?Ai@Tu7r5IMWxY%5|<%lIfhAoJBMaom6}Vto_tsULH?I6C}kIkm7nZomx_) zfIu#^Ht@rufnf@8kt7wAdr=?tDTAMTBa zm|&@58Z?kjbP5DYP~D_laKi6kHif$Tsm zrZlDSvZS!2Ia2^IQ8_1!lH!mV8Jtcd#|<@c$)k_>?ht(V4*y8xue0JvH&X8_=0 zHM@NQtSH$1wTpd7TmkSS2;g%o0HQj%UICB;fd91%!0~7^Spj&P$$B>PK6U1K-|3wU zLx51WV0av+mqV)aM6?O5k=UzX_+G&9j#4Ef$BKqZMqJm2jha0!e(QD53-9v0x59f7 z1&U~;U=eN)MVKax@ZAcbl=vywjP@}>1f~742fjTG(4fWR%-Xj08QcVM4EAvkFE1>X z!}3h9mF>BKL%kGyB`F4AEcb>n+yW-OAShpk=kTHSao|3<*9LS$Cp&H~fuxrOvck=G zfN}HAJ4+yG0i+~5U#BLS6j9&S&dt@OB zn^1)4a{y-WdWc?1i57q##}2_bJP0J7R|nE^b$V^b!A&C(fqYIEkae!U1022|hu?Qu z9D!wlb4dm_01|L8fCSX22xw}OSCF*q%N*TGV$UPAf@56Za&~ss+frP=cnGus4?^dP zIWV^*+q3VqFb0k)cUl)H5)g}f@b&;{dj~SS+Ozt(pyPcgksiZCoSl4p-*d*bZ|L-# z*PHQd0`AIW4K}wofX^BMXH($tD2hZ>y2eBbZIscJD5)q1;h30ACSU@YJR%vVkpgj< zfpN|vbY@Z_(`=x53KBHUz^sVvl@)h8E)DH9!^`$Dy21cm0M8M`5HRYaI0M%g{NsKY z+Sk!eLDv{w14@mI^!vlB{;3xHm_y>TMICkQr@BVNo5c+2zV9xqgHS?=zY~R-a z@;aD`-&FzfM?3w#b|B9I(LW4g*tcNyhgO6++Ht=%fOo*)hc%vtC^^hw!6NK# zcqPO-yI57QnQMu@TPhgWz7Ud00nD76=h_R4tyFG`r+qf5kXZT18OE`_QoIXF zi*G;9up`k59-lyCcps0OH9Y=x2aTKBpT;n8-%S9vt8C5y)o7T5kC zMh|)qc-KItzUA7Wvc|eh_sL|lT+PNHT>%#Fv1?AjM|e1aT7+)HFf9IB)z}_rj@ato zA%;)ki`@xd>;`B zKa1&TyqPU0Bk$|3In#3YwGx(UP4_*PICd)ms@^J5KePQS-K>43Pj(UYabM{IboI>@gTk= zQIti3DJms1xyTbiaxRlw^{KBW9ti;6QG|6u8`ep# z?a)F5uX~Sg5kyq~-(Mb%(KZo$=UMSrm6`T4J^+_F0U)`S892|8ALS`L8E|PBNHd=c zC4HZYj3)`t!Kof3l4WUxcCqb@m@E}J1Am#B zG@}|!4X`)3?<;L`!T~o5WJYAhbjH*mH@JIXgaUpBECuly6KO8ew1>VLcw=E{Qd!HX zA6UN*U9!?gao1LE$5{K`ddWld-9U+>H*8-;N2ed!AT>ViqvKe-(JsK}!Z|iRVGaB~ zI)OuQ8mz210~hhEUja}cfY)vvz{w#1MGJsPcr#%^ql|GyUFCsh`hX8lk26GT0j0uL;YqHzQHa3KT2ejugd|?+cY8czVP|pT< z;D3V$VNizT`e(seAO2MU9MpFDzVQVPz-sW}2Vw19o%Y=nu$!-@oAqou_8vn80>2*x zWfBb(Exl$ibGflnEmwpmLJ-_Oh2PxBhzNy@tR!e2+&+z1hFgE|ngLRUv(-UC7{}JT zv1U)~cG-d4&}0O3!oXq^<8q9+_S|m(=prw|I_}K}L+6nqiP;ZAcgb$$hwcM$72^=$ z(1suo@qP$gPyzCb5y!CEf&axC8guo%U!A$PrK?@oAJR+^NT3p?&Tfh8Ti@ z0}_vloNJVPnHB(_%xPp$8lpSFsMLIrGJqK5>0T`GsXdx<(1BO7F|uq4UP*<>+V_6$ zFnBF;&`;4&0i~N8V8=|;1bA30MwnJQ1J_G~Ue3VZqL~C0lfmmqaG6oyeG!f65%@ku zb8aQw)>c9DBi#Aeak1FE#>Rd5>?G>E;l(%P;N=0FTor%~h?*cE_XWdaj%|G77?3(L zOBxh7sBP3QUs&2QJ308W7d`0x;`NTC8*q`v)75;vS&k-imvs|?6wAO4HW^EE5QQlv zDM&gA_8~y0(g+<)ILAh|!B|vM<=T)e@^Qaj`Ji zsx-HYDea;m@%DWoo*UZjX7>R-tAyjg`QR?99lI@NtEnmo#+C$2f8<~N1Bc-C3W8ts zVj#;7RXiEb&>xwuHuKeJH1|FL?|CDr5ippkPI*Q!W`g#eK~?Kx@{J@J7obKsVG&@= z{G2BkYf+qVqtQVDIy4c;MFX0dW&vsh&|Im0UA1S zP;9KA76re$4~irX-8Kt;1Y(4HS>X^v(}CNsLCqZ^L1QetSMDr;TMnnAzPCZ`&~3={ zs|L+Ih~mAQYde|~4K;$anf*ma=1qeRM>gFd6Tqn8jHpyH9~ja|Fp(wP__$R9Ua$eJ zi!6(#H%$eY+$@8iGy?X=tef2aJa!!CJy}PSiyaL|9u$kGBbg^+1?iiNB&8kvyBxcmltje+Buo^vozx}i4iS)17&TL!c?Hk!^u+=Ysf zo6(+gFZ6+l-TN&Y+uf<2&EJfz?F}#YiY<5)6hpV6T4rk&M-#_Vd7zjR4eg8K@PBk@ z-=^q80#jNwkSumc7d!r${6n(UJnR*H!OltWi9IBWn|;g8kWbC>3HUa zz|0BHz$_t=uYuqMk0?5{Hh&`=_KL5Kyn)^(I=I0QGnwy#?{pWi#lzLG^X@C%tbL{b z*_j}4#8)~8EJtXyuk^fv&c5mm$f-A;en&;(Jql~TcnBp7S&c!x&PMC$W;&TpSKbWH zeeAl+ZvAl(eJL6x45IH1J+g6%<--T%2xM4SkN;Glp`wpY&xE5@c z^=7(QOqN@dFN*RAjc1#1enHq8cjuv4KvLJPVg1h8%YFA^^{&W62d;2=5rK=76>%{e zH|yCQZyU|A_dOeu0TiznDBfo|cOS)@H59+KE7`h%;!EH?_Z9BP2@u4)FhGkq4vV;5 zJsYOOGHDX*kx6w&d{V*Kbreet@@T=?RUM4no^=JI8=P&E{x8Y8EqCyJgRMsz7Xj5W z2P>!{*s7CHz75dS!H%0t1d|_ebXawPUkPjf`LG7DbU2yM=cCPHG@ZFj>r{d>>qmy9 zBBgY|RT7azsL0PFUztI|O&%Er0#l}RKqWYMd8AVo<&pkkMPW(=U%5qaM}vs!Bc1!K`v_hdFEA97fsbKSB~zdHwCRG}0tTo0oCX&=1b?eQ z%ls&KB-b1bKEJZ|{0{v*9)Tr2$I2bBr&r$F@9;d|CzLXx2RT-i5NR?kwH7AVB98_b zKhV@g>Q8i8g;|E4!4IIlv41@`YuEGh9a8Di*K?XnDgQvV>nY=mulVWx0n%c8asY9| zB(@bNabYIfyIsa-Y>gZwFWkuklP7Rzme`>s)DJN&FTmZCAhvD+l2Il2y-V$Lf37$z zc3`$yjqv6WjZ&*KD*)bK+bVc;3*b)1BB(FvO!C~Lz*|L4WHL)4$+J|u49?i?;>EVc zvqE`w5fko0JoT{lj{4&Ar5aVWg@P@lEg|-jgWD0KnR%h8T{m@MGoj zf-Qa4E;z0~twMYBPD~jQcW(_F0N!7-livK3rKP!J;V=8C2fZ&nl)8ifj;0&X^mBM= zIVvnol0`-)+UHsdC0R}+CDbH>8LEAjV+w>2DL1~aa*PTQlBhKIbp$I^!dW~gf=WqJ zas^2vy(2%c zCYW1EQcZ-dkMOh>1J?PLcsaj=JA$;9mUFmiiGjRtdA^A6MC7#G*8 zC2;a=Rb<|1BNAZBJXcyWk!v0)%JMun5sQRYxK+it&Nvp5%A8Bwya$p4^zxi1Oaq2k zk!l2IH9g!xIwyB(I{uG%p9D()^DTfo5=2nFj@VQSfbCdlohvZRi~)W%DFc5HxEnC# zehQvpiu+RYzp@)5UnQ znU7}+7xR{!F~WR}iAx{^HTOB?iP3^3z^KCTT=<}1fEdB6VW3%r!o)B@5GA6DIM~Gr z6(9s!(qnIj;~;{c+#o?smnKLK~O28K${w!5@2T%Asi-AniG*=vAqHavkVhB zxh6U{Oc=pn35)&wh+suaVz8K!unSz)tpR=P->IDj%l)e_`1pg~{eWO~|LV0L?@S^F z07skEVhJwRcscUCJ6s(1MR$TCV_ZN@8;$u(nj}DXz|mSIkq^2A zcTP1p=qv?GhshksxTM6eYK2^cHr+-t=8s_*H|S&D7=A>rFqyFcFf6k0dN!@eWysn) zFhuQUS8O*La5=CkQ_XEt?~w|Q;(|D$KAy}Mqw!|39E}!Nhh&H$NE=EkjOn7RrMN4N zzyjScs8FWe33NGhHZ?AET6MU!4lx%m?7KaBbZmy(|72JsKm?e^hx;6;-E3tS7QO8u z`Rr>u)b4Mn6-2iqDZ4?D$k8o0m=rXIutg-k5GvKc>ABAmkpds9Oe)YY%;p$_Mj-}R z8}v4udpyaVcbqtiTgUG?75u~n@oQ4zi4m-jlTB}pw;o#IX2h;0xuuB7E*DRWllKZK zi;vjd)*Kx7*zUKXgYGJJ3;X?HwL00J zHLTQKz(t4k#JdpQR6RYe-WLhBHPYUB;mxeXFRq0eZan)TFtYEg*@$<3`9t38yS$X~ zYO|iskfoOs=jLTb2%<%n=FEWc$ADXNLw(M{Q3iGvK0Gl#<`^?Ai80`(3_z_&!BZc+ z8qBj(kxH0IM8ib9bZ+LYZ6{Pe03TEUd^!Snm%O$E3jw@dAwdN7I}SU_O#t71=6OE{ zZ&OtIAk>Vn;a8vzCL}W;>QZI%vLgm|HV5}k=+vZ{&S@&SL*SlnRaf;3Hgwi+Pl z7iIQEu3KP};B2ZuFSbcEJ&cBJbJhEO*8##N>w4^5)gy*LV@k0HDwwbSeZy+s=GCS zZ0#coy1+(H%gYInHA@K@IR3V6Fv8OA>A6T?Y|A6UuuN;}yPd|8^KH?^(3yz1j->}- zQOR#j#9mpe5e%{60sIZH#-nQ3JWyk01Pno}DrN+>tJJ|5NI7TXD4Q*v52Ol zOi};Z^u$P{CW|!jbKhvtJ~Y#bz;=81{VnD6Vwb3C1F(pgtsfa$%du!bf<$Mp7`qH$ zZf6Lz11BS?23U3KQeiPE)V{s66tPKxRV29T@%O}^n{;CFu>Ds!r>>KTB0lyO0cpok^tyQXo8`L2oGdhqPsL4Ds%dqxE zT?%D9-i)W~(FofkF2_Z(VkatqFUAsCJ^=a1u|y4cUHa&xB*4+RE5-m;_DOz1Rz4Z< zwdm^?*mTcQB_p{xmAvC+B+K9J+5o7NAgW_j%PlaGgD`;!?pU|_gDd#sW2_)3d<1LU zgr$*!<_6jrsGyNcfzz=872b!|91pFo!9K9P{?L`^p*9-EHh~zG339Mi?pD3mqZH+X zaQ771yC2s(yEMq~J6p)`J+StX4vjJ%uK~Zs6zs&&V&ODO4se+Wdu>u}onQ@4O3o=) zn8j@iSW*y@jDrh8%cfP^A6QAZKexL^xq01^VVQaN%LRf}GDPP7UtLB`gORab;C*MV zpPD;V%_%PWQWZ2-{y5sJkfQCgRy=lX(u&q;fg_<_(JW{L>{5>1yPPNtkN!(FV>XVN%ZS5!UcRrGkv0dN&0_0xhzdC@TKtiNhr`lA zNID@bP>(SM6X3I!bzF2gt93k!E4L0i+qNmT=)2otXd5}XbBd;E{b2j9qI24==~usC z;R-kJZOzx|u|)DKA1#sGc`D!Z-4A&`-5G~nVsoP95|NxNyvOb?s&TL3_Fke0*vc%| zpr&=Av20nTU|wVI0Urx-jmZ_bGjPvyp?MLYwRIcdQe%@@nu?U|WhU}*qIJZgjzW9 zZ&;H2iVBeTH^LCd0J&acefVfLvwfwo-s;jAx{|&U!|aGu`71cA)6*Xoo&93lA|DK_ zv$J0B%qpoWTR;4l2w`b9zG+K~yW&WMHSElkuhHUnuwBo$h(ryit!`1dfpx(~+98W^uXr|{x* zU>y4=J^GK|{V@&7uwazh^1Ur~$Ls#t*n8fM>zzK?c)VP}?E<_D?+y_0Z@k0vK5%^t zN)akWe@tzmAr>0tHq~9zf9#*a?Y6tgGJqxngG8%Dq=|0#&1^&+G*{0Co`$Z5)(lcG=uWHC8BRaHcZtsWEjY9=G1kBc99CUF~BU629Kk?9$j(jNj8^ zSJ!)cKpb=qPJJ}7bLQ@!6*zs{FCcbz+n~y;D{HSlB-&T5r`ZG);{ub+=X38duy+%c zp^!2<(wz7?l^QuyC_^J+s4@`))+Lya3ffbv9A3+Bm$cYf%b7 z!9y1L1A}!vGDih>8^zKpU!z$)cvlJFH*W#lMS&31S9h6}`v4xs&#&S~5I<4km$%^; z?S1#t9Ob6Y#gEqV%e%^O?*IOM~fnYlqpAf%&Nmp0n9WWU{5E&6Z7lxJW>7t zcU}qS#_qU+86)8Aj2>tUebg4h6&|}bI}HF4EC=AOIXQy*e>zC81;8o-1{?>PvV>wm zm14}70372i1H=qt#0OV{P@fQ=a}96IGbJ^b*rd%!PD^->?K*koN257>A&o53e#waWqif4sa1x>;@Gnv-1HpCkOL$NfCXly#7OE#I4ljzFnqz% z+((`s_#~By0zh+~r3DXP35CU%GO4Sb3U&>#%2EvZ>3xrG@tKFb7j)Uv;LT2L3#!$6 zRtQ0!N)UFy#|mgGN0e|(mCp$GfvYv9je(a77*UoR;bST{14`rs1R?QR%1|P5V$9)= z##edNl~RBI3p6HwYzyt%@cXk*pO4>cwE`uwUQXAaC-`0LBO8&InkmST0PiUQ{IgH* zd&CIpf9f&?TL4TA(S9U-Y|QIN5vG?Y>GOme+iMYX7K`)eLc~1pVvJvZyB%u_ zEUc*Ao*fhsn_c&V^6x`dwauvHzSWJ&|y)vA2 zj0kO&R9v0RXD5Y~vA*q8M9elj#Kn*FN8SV1?x|TaZ-BMm_quZx=d0-iUYRe7>A4Zj zwbDdmlV{A<)+S|Ov`C#{U4SrP4FYshCOYx~GoAe`QSZg>zf-HEdq1rpDdUyw*DBER30H9|naNCm?*UVuGbd~lX0XNgd zarRiyiLUU%d8y~%aY6c6j#dl6V=7z_0d+h@bk81N;18YBah(5|l?QFVSVyFGt$on*W z)@$y@4#k)cszt;#*y&*Z3^X_`IE7qll?mJ17PY<)k~zuKR7>o=FSr?C1D-*|MFMXW zGNn=5RPt_E&M#u!8s6bBJjA5d`V0%JyoN0LEMEY5ZJFjX9{B# z#1!6R(+%tpZOSs5=Mi9LZAd^#i$R(J3FRWmS)S!!VJ4BUMG7xcnSkmVXdW3b&e3p} zhUMuwc-+!)35K>Mo7GT-1F_84*gO+JeBz$k47#6zwWoFMx=gT7_HyO=A51NPT#*!K z5rZdfz@e6z$q^6%9(Tee1p<^@YVfYXE>{$zj6_H1?+9NhfSZ$n;7mpAT)9IJCV#wyt8uLHumB^Zw6l0o*y+NRXfD z+AL@gpN(}hAMhc5o)hD9CA9?RC7A8uOHL^E)aM-BL;=qkoSEeWoN`9EE#>g3;*`*o zaGylzGM~jo4XztThw%d}%&_^1L5%fTSo7;VVl3$`LJzJ%PH6j=+d1TR&{yovi~F{_ zU4`s*6y#ic}nEf?Us*F8>JI9W3zG7e6zZlkjxr0K+*dAy# zT48SJYU(O)H6bjMzRKa!72_J2hQSh8ZkfP4hAC>=Xr(yZD}1O8Jm8x0gvm_f3=x?r zb-U#!6knwxtfE?*{1<=4zPLwJ)6VSweFV2~nTx?iZIQDrk%R?_)$_Ikh@Fp#W%NPl z3PWr+1r+*rkb-SDQj{!Vs~;r6ppVT&hPL+t${AO>i>)2(EH2w<1mAVwAKutV^Af#8=sGV<Fr-Fu^zu!fvn}!OfqEZBJ z^Ci#rQMg${;f8Une zhjHP_dNY~N#$)fX8?J92_S%wE3Jk<$BEhatG|_368&K+6rmhhMw}MLMV1^k3p_O?4a#omui{}0PfJdR*o0&~%Q zWmxp4!0vtcg_@nBw$Av?0?6uC4}yDedg?xq+H)gl2^i!ladr>@rYMUPZYw6jZ7cd3 zpv<+@m@z=GQVwl=o+`}BC5)$xE9|5q2(!wr*E`r?zQ$8!3E1&bu#2mW)+wsnoPk$- zJIeNZYufSMu=d%*z7T*2vVFFmjxoZsbiPn3z&=kHV-#EJ?UpJCQbh2q zYzXk~0>LU7BC7Xxb_K2@=mCP?=;6Mtk0y9pK$1y?knj`;4=@9%H^9+u{s1j?363Pr zlg6xgNf^+carN!ecHBAxHF(zRpShkgeYB_hSSkRwsPA0UilOUB)U$0Vu;)b24#Be{ zXKWuX%d=sxXrkELJ zJL4JS&1!_NjHXMMz?b>P05SE;DV8JUX_SE_gJJxVr4WG=O#4e_{Z`ddT1MNPoB^y(jf(2km*i`&L-{w}-u2 z8_*t4M&RL3=c}>HT+Dr%Q$g(P5U_Hxm~C5BHTFe6N|$SVB3azAuwpg9C?!?bL9ECPL^oHc<8UVmF(>jw~yv!?uMj z`V*I-q_A&CmX|c-n)mjs#eoU77>&eDc`#98J15rdxnsmKbiUcpZe-b6yx0#A_iX~Z zFE+%-2djWCjNq#50z9R;Neswy}YCjicg&2!87p!JQ^SR3Gk4{oJQPLXGC! z2M$hz5GL|5s6qs_StC*bkMe{_$p^r;7%=d4rm~zd?4Q6RrP54u3F=5Cxy#J9`;j)x zW+(9lLFj^TwgaV&K;fi{qWQ5MIcwu@<#ory`Y^ddub>**#sh|>*@zZ%LEo4*EC^DW=w(E!ZPLaflRYR(>#Y&EdPnP#@T%= zlK`=i2^CV}5HKttpd`w<$*FCI^i*n+#{b;!2;Vp(xi8^K{uES^^aG#{>R zSuVFg?t~$N`>M`tr8T-t;ENE)qHP&BDratgmkJ{# z&?+cw0**B*FH;ttvSv! zs0+rxWy{$)DsGd=wK2xYNqw7<6@+DS7w+cv-9(L{O}DU9Uf>s3EpZj)g7#lH?7QqQ zq-(c=voOLfF$N`(#z@c=w!W`cU4?)lh*i~$z`nh6#!?MLHv$nrz=Pn-FJxQ$fC%_i z@FLt$lPh=n2a{F(P+MzP1iG*SjGvCAONI5*OCm|6$8nE(P#4! zv20TL11fQN3pY$sg>`ORfXw8%PZ*ew#y3RR!Bw|WxS?_!9!b0faAzbUsBvd4#U6l> zOaVLaJwWVbzCv#qpvoC$u!ERW0xWDF%bf)_gGi_XYn=Hx#{?_1aj+UtXqj=5oPRO5 zc3k+6)9x#me%`-)Kl*uf3d}w*^8R6$MjkCUphZ@j^1 zh6X-;?{_!JfYwuqu!=~C&Eucfd=mKh77;ARilffB-@3?kF8;|br=+Gw-g&1>>U|?P z%Etgiz`n-brV2v~nc=n_7~FEB6=xYhO%40?LNmzuO|*lhI2@Un=7fWlS7T@p6X){!K8uU z^$LP7{ou&^k0oT4*qpI^^+ym_W4Aa*H{Oe1?=s6j zuy5BOcfahs!7B|;J_QRK&GP}ha^K{_N4H(`vn>cBZT8M-kKN?tF4i;&1LrTIi9gUR zi_%O2ff_zA*iF~>g)+Iu>Ih1)Rif;Nn0-(y3@SH?Yu7bw*e79lj=>VQJlZWB*y+M} zc`Je?C;vhVCoh1L`(1X;7+_q_7R$|iz8p<>ydhKL~obw69ouZ|LLx8rmmu560tW;UL!HuEJY=+7SzFC3F?e2F~}%b4N2 zKCmBj{T=|Zt4Zg0Jm0JqW7I0^Rat6ttIC9BJd%PX3^Vpyv_o88Qo*w=xx?Fcx#Xjv z%Rs8nISk!zyyaaxv>n;CUw%N`7A`-imp|5GVc!b?zT=RET{dr+tZje4RWU@jBsOso zV$1!3ui#CTU@v^1Vzvm~_fb0fhdddF`_C%~-d`I-{HYcM-vJ1|tgF^$yaLN+J;9Mz zrEPdAYU7gnSZAJTlBrZC0A-}IFbunx_yQrVqcUuQ;NV$yo2ob}2>#7lu4@~Dzjh3Q z<1x_Sc(GirPluo`5Nr$B6dYRU&;u9U+^OI4?|~2k-q^IoZ60EAQ=H$kLm@3&R}g%9 zZFb6MS`egw;LjiSlWm^ibTk701T5^LOJ|~Pk#}vvlAt;h1s5NtpO&rO21*x;tF|At zo@@Ia-Zn$&b1nV}MiO7$xe>n~cfBz6u@>q}amudfxT<(}Mdu!uN3Y_?6so1m!@E3! z^|f= zIt_NY&de^@6(7tts$843R&3*i{{G1oo5=akL8ZZ2ukc8)0DBmO{gP{~Q@k?avSo^KR#tc}4Gb~?UOg6LmVz#(41aGTj55YgFn0_UKuc)lOqJsp-ZUo+R zz8No9i?w&(U0_gO5WlPMI#o$y`&AVD`&ayU?TY_RXK3%DSDfr$@vU6UdGjKgiF?0V zRdv-XF1ZBV0C`@&wuZ@nD?GxhJR*)ON+>SvR&Z+^{ev(z1$Euz@3qL8UxKxVJI929 zsXAGWW}xa8g~56%AtUnhT2{6iTeXE&d41GTwTc_ZK}k(pXb_Fq=XW;a-8WcIgr4wu zOF(_AbtKfXF zTfeD{nYbGFE_-#+LV_5R`(&5B+K8vyZb2u4&nQx*nM*YjZ*iWEp|jwCr?zLW)`M@0 z77{!U!A30o!~mANIj+k^;Z0l970IG_b>~*vrW#0{aY;o*^vpK(s<}UwQn4{c5uq!q z35$=9ZTP<7u=Z@%g)0EC5=TEh1zv)JuJAZ91@6btuL{6gbB&s_ei=LWd}&Z9slAv< zvv=>XCIp||A_P&3%sN}S?0Zb7f{*#+xUkv*@;oOh5?4DLZjUfN(2=#&UTtb5DiD0( z2Y1^zxwl*U=bb($*n;cT0#l}DGj9f`e(G+|<571=Ez5Bly}@bXxr!)Gn2d5`lAKaT zn7R6d@AF60SA34owWIWS4JFk{XIFz#86bWOO#zt|%O-d-0R7yg*o#7Ewf~ecLbnHyMYNUFrUCVYZ;jny% z>-tavQo}EH47iQAyHBxO+}O_#?y#<}cqw*G^A8$zFKv-$m}c@#o#Th==}sp`aEHgR zD*Np=ei#%UXv==hag2PS82Q#_+4n292!3}94Zik+odx$}Om&*9z_|q+XQLtk`64pk z$F^TGB`GB&@c~4mXhN96)OdY6waus;2gA7E(_)n1K6TgFzLq`d%TpX-u$*#Yu+2FV zF*e5xdzEeJK9()^g0LjazUz|d*v*khQHpAt!S%`*FsyDHW7?GZe!Sa?E{>hOU0G`c z!!g_BuxIw#7A8ir?C4j}N`faNK%NZ`9#0WrcKf!kYeDd(UBQ6^isVG}2cMv0JBwlm zS{(0o^$WtpMlOZ0v_i@)+uj$hb$)2;Qf)()Ud3fC>|X2w12hfyw)!@(v4A+R9bvJr ziw#LwJ!HER*(9lh2!4Ny7<~DMM&37c<;?;UPu8O)7{=4-Y+B}S-xA@$`B}nKD)s47 zbHeRT^AU8`@~HOmK?HxPg$6Nj`YWAN*!P2^7bP?uQJn231vW7tEaD_$iNV&Axot9K zefY3!W93M(Xwtn6^cA~Z)k`jInTo9#7u!~0{jA0wwhJwUHbJMN{==dVQAft-J3d%u zeC`m$T#2`KA=t<)dEc35`#w;{kA~13gb)vE3*jCA#&+Cm0HHu$zhGM@^~)TFVLWu* zmQZ;(J}u`AY=3Ro#1`&uHHbYgtaOivD?ONN^)q{+>IRA~yX27q#b5hi+3?|lC|0sd zFa!SW1(LRad5a`DpBR%Zk~C5^BynJFfGw>1{a%QT(p-D9VjKA247P0-n3PU^6*iNm%z$ucv9S;%M;3-ICJFupbd5ZIC4rZC$ z_o026_kqNFEWNYk^(AZTfpcGLdRcO`2(; zH8zC6;RqUD&af30VE`n+|74hk7ytZ&GLf|)_%pEftd5N%TWD|lfz8*Wb%Ee%(1XXH zFB{Y)g3tfZzE(lHe5hmdvW*6@$R4n%5YVYW_4PV}ZAX+;=+T|j*ST?te8{rxF)v3yq`_}e1bbV*` z^Ks(JYFu@x%fk*^YTLx6eD5$&VJ!ZppyOG*T>^#B7AWVS#0T>+O zAj>gvsNM=&E1cNLhXM)`iQq32gb6|-A_yV!wyae|?Nk=k#Ny zpEYx5`hMRvrgp!%Gu4kf{k!LVPM!L;dq&xK|A)PicwzkVdq%fMKa)?57|i|7|9Rns zZ}&VuSnQUo?Jilr;dOs5ir)Cf-DVf~KD;c~tNCuZ^|s5=U;4VyC}cz{O>`1cB~_x8 z)H);%u?SD72xFN@q7<>TWf36>BZ^X@!qDO@Au)A$D1>sDh)FcNX~T70+kSoWntPJh zzUH3rd%yQp_x!Eby=L@n1%l6aAowp~?oYg6K+s$5g3ZDYcFUzVUt|bIiY8POmV^Kh z=awc!vM>x;OaVtFX&5=wN_E6j6;d7x?vR9qfHju@A0Zt6079!!Id>bd)-WUZzxD|J zl|6#5`iM9BE%@1q;CdGX-g@T;%ayka)|+KcgD27#jZ%wKsaUL2W;v7)i|HiQI%W~q zRP%@@whdqTxA2mXSL6Bn`d=jY`*hsHf>xuJc7#wScAdF_x!;ToSp{Y zQ6P@M<<{O+X_cW?xQz+dg={(QT{*ZJyNXz(Ot?&Ro2@SPn9S|9aB-+E6Ef`uv$ zHuG%&uih$0aBs^c++v}iAQc@>xljT%3z3q9BoUELLCd5m6O1x28AG8Zr%A#RrU=}E z5kl0BN6vQ_+tfv_f`=6X+ZFO{Rn2d7^rUzG`Ijt9NmIz9s(4d!9cWZyD5iG z96!trHVaMMjj_v|S@J+ih9Sc0OSj_vkUayJ)rdw80VV z)F#(RgC{&{39D;9?aYAywh{c31%mGev|icJCT|5<_|86Ah%Si_*sZ`VS$n}ek1SlI zfF&U;1Q0cYmuuA}XWzJj->n)2qULQ5{)g(|-K@c_bADeYVWZ_2_tCH`~=C8DMC21BH} zY1JSlW|S0Vl*A}VJP~UlqzFTX%$%mhd*ob^GD{?m5D^e>3dD+FVh~YSC?pA56~&m2 zX9aNp5oPazfb47vT-NcVl!8E0mR z*B2=2#}EF=J4c7PcN#rBe>e9$Z@F8n=SzR|ukS;N7#;yeD@9_- z7~=_%Dggyd7>QI&82x(#sF8En9>)Qh*L)Gls}0QU8ZbX@G**TN<}ysq9SzZy15-+_ zfAo0Uqn-S&H~RW}4Sa{OQGc8H6TvE%Tro)eB#s?P6cLFEc}kNM)Tk9}%)0>=z-jcZ`&WJQGd%|$2GNQ7N0*IAuxm^0pv-E(iDumggD@d>6jAF9J`{Olc#zeE4|<~ublxLk9VNw$waUMATNH2AYXgWqF{OZmZWwOV7QZQyFzyJ6B#6vZUavcQ#c?ob zdJ%kG2Z9d)f{&WQ0Ul7F7pxJ?g}=!I0%yWySsEPSl*Kd_g3?es)^#J;y0?3ZJm^L6 zFLWUIZb0zY4dHoAQwH9Ax!BEvU{e?~6*lBjDi&!AIF-c_@U18SjU=IhCZKkS( zL9i(Xk02PJF|+c41ecqQKffG6d36AA>Ce~#^AZB2~!&3pSbai=YVu%*n=4!~x` zP7OBJhAnEsCP24PDxy0dsyNoE<*O zf9~Mn*8&fJlfj%uUpH8J;Mgs;i(HMI2^Q#-h$UctOGinEfyv`spolN#{I=a$PWP)3 z^xm30j9%V<`vZeD-Qg1-#YIs9go!zfB&K1cG>_6Wm6l){{5c6`q6G#cpq|#v<$ZeQ zdi|3laa2QY=*kH8_{4)Ec*F2qky}qNKz_b5C;@C_$Um_Z=sV=h74vIIw(Ro7I)IV2kDtO zX`GvL&tHuQKGcCAn%HkQGQ}f#e^G6;3s!1;V3~}{D9E(2NR z!D#|G#X6-Vrbulm4H-)blaAJ)8ngupLvm$_P%uYY#7d%%@P_iL7d|LKXg6nxv0-srW*0)v~y7L>_iJuiIXf&nGd;E)Il)J6oH zOcFs!a{KKG{fXOl_F#v z3kmNOJ@)(_vlM*OlWiXU4VZh8Y4rlCk>wg*S8LzTYo2@f(Md^++#C})l!8VH0VV~Q zF)#`Vs1TV*O&Gy?Nut7t^Ylig=ni9e1Q%P+M+AfI&SjTIxO6(UD}E!4tdebP z;|=YJJ)PL_X>3p1>)`%af#455d9Vf3(P1t%=@Eaq+XNfHa*+&W(s$gS(N`e5)ySA`4Dy4TgPSEE&pIzJyY?~%(~z=IuJzZ_6Mfg`3_xh z6c?VPL?sbRf!3r1-3v|!`~xpoatbz}v>YiUm^HohcGvX7H6hs01w$g32agVb*>)E9 zwX?9Zoh5u-x3cON>2YdLCh%sA^qZ3pu1A}U4>xkl_?Ek$%R*Of5l)@uDdX`J>zUzx zc+WfsYz~ipu0xx?^yA*>-}LK(04=&;?QMeHJXjZg?Rj7U?h(NXEsH7a2O|?TXkcK7 z+Z6Zc^(nrv_bCoEhE)Qw#@RW)Zv7)}x<@h4HjA6HJ{vT&_C!@{{dV2M$o+ zf9#{cO%dChg9h2nJ=7!Rssm>r2$?91lVXX+zNn(Kyrx<3eY(fQt}FvbUY}I+8@Qlg zoUI0Ovz$@(w{f%gd+sj7({AxQ+@ho-v;8nUpD}YaU;aQAq%ZmSD;<3NIY99rm`eYA z>1TIj`Q?d_J!AXw*~*+B5|^kNs|D7vkb(yXTZ-TI;8%k-w&Q0_IO^lyf_ z-!XY+-eMPQx9jC@?QaT$@Jx;kW0=v7bQq?6;V(d%A~;2t-Nn^=)9e8= zk{pBVpP;usEf%VRtv7ZjXa{D|gBkkpV;j8syi6@SGQmoLF)nAzAc7ghW-`s!r=c+j zS7(t17efoU@(lwAc#|1M9N;)cDeP(8Aj}oXA1y1ODfnwU7qOzObjZG){h z`tAENJ#qy$O*z+^P)93pd{c|rcU5;()@gkSiK`9G?HV)>8?%r@Lo>Ug#BUc^^nUKG zq{5yyJ0fu1Ge+Vsj=eN5?UAI_c)Y4rx=$ z>sz1jMt{>3%=Nc02`|BRvt5){>jlt=Bw<(qr8t$*jkj9sY3VnAyq%UtCHSqzj7-a7 zxEwchU~%15lo5oz54U9z>)Z~=!`ryhy7FZC?7n)ZioLyWQj^!o@9m01!;s1X!QP^s z?f}R)7Xbd%3I)p1VNUkDpFn|r(S8A-+>Ty$Kltw_+XF%;F%_JrO373(%_5Re=Vf=D z?kWh4dJwu@L+Hm$l@Hf|(EjgPdClKuL@+r7$U4fvYBQPjln03`aXS1oKAdXTenW$X zMm4MREge=TIzz;`vc2v2Tmc@Z2@hl7vV+eqLS#@f3{4C;nyzEr@p(^&9{3Q&a>!X#Cc1Kl|c zxml!2M})FODXfKyv2&Kg6G`we!S-YlPo~Sj$yJoz{MMjDA-J8-F)_W}e2pXcwokP62%`F(#>$a4 z4Ib2hCa5I{6Fk*KV#hJ?c6H(ai^q&7DiYkGu?U%pLm?~{3)DJ_f&}G~ljJ6D_h3p% zpO0Xj2LIni(4ZW?UK0G34lx)2fG;-I31^jG3oxhs`J(70coAqxO0k0jg~KeqiwGKi zZ3{uGtjwt(_>R$G?j1%&0xt2~!%E@p*7HX1y$?LP3-EjBu3=UQcm7^CDz|H>JYp1W z1EXU8>K%X$5ISxKfq$3n6Y0++wQFx4cJi^3<`%?gr-8qAs8Vvc4(qf z86s0j^kox5HBSH=>;^|Ghh3q!se=00LV5O?-j&6m5?rYIx48j;0d_ezD3Go=MK(WY4QphYBTz3Y=H zXIGuRLa(yQILh9bcg}P1ttQ4L1-_z#0$=j9H+tCEv&hdtT<(_h`KD~;Zq|Y)P@W^P zJNX?^nBDg`;@Qi?XCvru!Ead3H|zQE2;3`Px$)q|B))5SzYWodn$o+3) zA!gfXJD+K_CRn?!Q%gmR>PS@|;H3Qk`UrNgf>@*9fIdQNmoq@{*NknS+J3_&kO`eN z_f2A+-g3k#k2DN2Cys6aOEvP{Pqk}-QMJsDMB(ddq{5~I*i;BJn*yAgiV2>*(5H-U;8dDrML_b!9qg{eq??gf|)+5!ze1s=wQ&>78_qfdC` z^Q=`V&=?g*&a7^uoSeRi6)c7$A?s+-Lhu;LHmH zY&P&ZUlmRN&Sy0%*1};)Dk5AA7lxObaA#lIy0A9Ut3DrosvUMah9DN7&NtcO=(QgV z4wFQ3JEWnV;{5<#r90zi-ltL&I_n4}StY4T1Ifs=)g99RlkynEMwdcJl(< z3U%oNzxj)zDbQ&EC=Sk)qBIimtpt0faSGR1JOT_^Dx|b5%dS<+-;~iSuztBiVErlZ zIlsZw?HUBuauKX{%k{EoE_X$tlOd|R#N9!ACe&HqR7Y!m~3K1W(yEmvQ| zD3(g)u?`Ou70sVDEg#cv1$!-GUCG>YWdNpjeWi{1cP-DU`6^i z$hFEq=rJeFca}X$gstqtKxErje%HGa_3;iGMBjoMXt2u@yM%3%FqW|*A(5%z;*>aC zdmO{>gf59a;JHTvnU8k_GLIqH;fW0hV9|yjT%hWM<_=OdONKGB%#Kix#gNeTbnD|QuWs)`409~>L$75Yo!(Pm~sn(3e zaY*iZ2}yTQpclcKm0@n6DT+_#b*z0v*vr7?dnS}!* zH~>zDkq24&k9Xvcnp`{#f>o37`VcJFd=HGE-JIzEcpO2@z%~)YOzn>vdtbNn+LwVA z#+)Q1(Qus6_9L8+)i5_QR&$<>U?Zu`#ifVJO3d#sKLB zx;QT)IC4S{otZ!P$GI@3)ieh&_VsDgHWckR*STm}4o#B8qClkzj7g;qgu?fI-6?i*;v-Q@amt}p{2B5x_sUg=JpHiOI*-^YNt!DQ? zce@7NTTP9bt`j;F5{(_*j-Yb+X!30zbXTRi`4$Z6?Rx1AO?7tORn5*y#q3(6#3J9A z185gZ`?K@2V^~9m{cNU1!m*&=N%nDE?%!D9`u*-46}-m9s_&wl>^i9KnxHHH-uoqa z74LqQbtJV9k6mnFSLJZG32@L<=M0JP-1YSk!PRQDTx1#FE21PFAyjHf>8MbJcQ4@5 zmk3u+w-ezwXGnw(56fLa#Fz+ww1eL;3UBDd+!=8{Sr!q;VwRKYgbQW~IFCrF9cJYl zyenYS;#@oW0lu%k3OIgzW}|{ja`wH2V1$WyBTpFd{O#70Q^$;+BYINK+nJ zI@PI40lr%nr7)>ddJYHP;fmK1z%ue&A;ZV2WavH9_ESIj%#r2mUbdEJwc0HJPp{Zs zgAmTR#GQqJ9Yeqj_rsAom6BuUbe?daOWbFJ5kQt$6!$?Dal%3!DNUX13L-VV^CIk& z-`4A=B7)6&3?qU?{q-DsIR#rk@P`7>Dgea(gvLj5E3a1sse{ei_sQyqCt0N$wo1&u zPH+A22cKz6LIl;glB|OSS(*ZL;H**>!9T^qG?ieoQAsR<|2w!kz+wr68;npCB{*@)-vr^3ydvkyMg4wV=Y>?XksKyCq*A%Nl%0NtVus68Ir_q!7; zXdmMi>lim1XZ_FZanY+8Yu)XMJu6aZHg?m@CQ1WuoC8p9l$Es?v-5>3Zecd}|AF@I zL}RzaoCHe%Ki|Q_>9gMGl}0JJ*{wEUsLTPsJlqLr(R0vZRFhj0v^Mw_K)ymIVeWI` z?jUZtp=rf~$X$w(`|NSioEoE)xPWg`D%-py)XtqZJeDbg>2#8-_DSAgbUK^v?~74W zJ7eInJ)TU8a4E(};iBs9B=x<=&jxV5_5i^7+8+o&>i}SeS59TeIQTpUKYG>Pzuk`t zfUiUV?=6Mk7drGX0|0-y-|6=Q+@xu{h3VyHvnsMa6g;g2R0;lX30y?R!D7L(ppf#E zkpyf%tyIK$B9bIku}l(&5Sd6Tu{hW_oNyj#s=iv9X`PK=wv1;B2KCm*bv$0M{8~WJ z?MLu`b|9z#!CNLK#y-qmfChE2DY*DNmC?b&>Mq!VO|-h^dKK|kI!qaKQv&1W<8_xL zcj|_LFVZRx!&Ras(?V;`+-^8t%tyS0xOkHQU(DSPBPf#f4o1p4D=UWkp7n8juSPjj%Ty4&g$Taras<5<7NP~4<e5IeJ0 zavN_lwe88Q+!iZ~R!pX|Jdd2EhY$TjuBxvD`jlH+p_MlNWD56usQ{`Ap zD$P=jow-;-i3TGmOf4dX#b|8lI82f=+f3P2O;Ie(om_>~aHGNAd}v3O6s5L_INzZmIg;@ zSe8+Tut*4pU_y^4BzfqhwUu{bRL~k{GOvx+?HXEdHI{Md?}Wf7p|Ma`$Ejrzt<=rmiXgM~M&DAVkScms zjSh3lgklijs*PY%_-hY=;JX~Z^MD%$>eBeB!1R(Hm~Pi#`n$%agVzDmtZrdl5r8wc z`ElE3TMMRAs@?m^w$=6yn0u+ISrBf^;I9|BOZIwIY-}+ID{5?wcH{Z6t{1Eh1aa4Z*)|T3K+t2$tXZ#0;Y0 z(?G)LkIdjuwEsdk$tuNrJaW2Xp0*A4y17 zO%o4S;~uryIGBGr0coiP0-RFDwX!G+BOY@hm<|(|vR>9nd3O!c&hXQof#1==z^GWh zU|NaWp+ruHhY1~V_>Wo~_q7GoiHHDvZSk0dg!3Cvl3O2U=i)sb2x5YAWh~Hbr*qGO z%?dRG*dl-pSURFq;hHvkI==sY8ymM{S?(B@W!U+uyeuMxQR6WdlVYHGawsUBXp5 z5;>P+E5Px{Nrhp%+jDq(5P4u`;$Q5r1i#@^-slB`B^datU9eez;{k`~>vFS)3xT6a z$h6S>=KHoeg4=Sx+ln1is^JF!!RJl3bhht#u)v*^R;x@4pAA=nEp&13)|i4JlgR${ z^gVf-dbe-4@!AuS@4a~$D0FA`q0+Jfo|`3dzgPhEr3(4|KxOU+41od6q6PEy8o*o6 z{nhBZUV>YCGwD!VKx8p6T6hTPFtGg`;ovUvG3JAG%3^F`ty9ibcxSfcs+5HOCK65i zxo_8=`=6M0G9Tu-ultdk3q*gjDsE7;G<IpnHt_lefjq(Z>u4821Fwu~bZmv_K|F z&KRC!@rXoJ zBXNs_R{3{#3kx3WIvOFxlCPWQ?5tkIRIy)+{RC>MkVLT@;+Mg&2qjA zR$H%-f>(t}Nl6pOp%H2lMzO?%L>6_3)d%1;CayYfUN4h40l?ce0DI$?*AAeIRh-jl zw%F6n3bd+qL|L66dZv}F1t;TiQC?zWVQ0?b`}R(m{umeQg7P(?2h8%nRo0t!T+U~| zuIN^iO?39cD*=3e2Y|@dfyshvv+!W>F=xNe$xRvEwKpYzztq9Px1aMyf7=lM^#c#A zneBYN^EX~z{uN5XK;l`9azZ0XC`iCm14JZ@c@mS@Nx?jJ7&uW0BQc;FbFgJX8G@G@ zQVxC>peUS_CpXCPtbC1L*ZotUZW}VM002MG?|Axoa^LfPpu_DZC&8=I8GuDo+|^NM z%bF#a*RckJI!aF{WxpJctB)ZO3fTdUDBhF%2qA+oR9WgD%IVD7p zn`9PNMg_nJY9#n@2MMBR{C#6$yDbL0smTE|o3Kzs772+ZsdjJ^5}tBEPiH$VTa1T* zT&O|}ODK_&dWYo0TYEpzjI$IU^O%iYKixg2OJ?)lwmof=V42Z*^tpp;S7vmOEWg>P zj%y_7Wzm^o&|}!;$*n!c70GcR>h0m*2WtF#`?n!Dw{8gB1ah@QE6r&0 zuU~zngLUn?wPWQX&ps5fA64GL7qi@01~6%Nx@caQ^^ZHELLPX&ZN_{b%>9Z%0HPB1 zSAgJdzS9*}AALZn_k#yx_1GqDUf{I)pfk0Wd;I zWpPUjMYP6pc&QcFfNaPUky@Z<7zLCqsin}djuoLQNhv6y^lo;s?H{#bp&2>`@ti76*M{d0-7T%B_KaT z>cp6V`fBO9QYw=-J>ND4*Pro5Z!^XSf%@iPQ~OJQkz3P4p%;>D6G@}_VOG?Am$r8l zvG+OCOu5WvW#8Mb`}U^l-qfK?jsTcvi|77wxm$o=x4sSpPh!i#c=wbW^^)K_I!N%{ zF!#PbWsdj({dj|Qz-kJ%oL}+13oJ0_nXq^ZuurfqRX<^NROgd^7`t0A{$3Efwv0FdE zyp>MBF8)x5AbbYq{<*;>%+`{w7d|?5ez8`*3$XAcz% z{>$I-J#qZtgS+)^Hl1vC)9q%y-Yu4k>3ZqQ9?-SM~;$78f7F% zIgzPSDW^Q4gh?SBnQBQ!*G1v#%t17+(=*fiW|`b54`Zk z4}Qlh1~083_<{z4Zvg~fe4_)wO^M)QGTE&*v+XqkFf4wDwdcKz7Jx4+0Q@}z3BDQt zyt@lPAVC3Xr8?PiPZJSV}#CxAmlxS#rx0+tyiTqTKQzH^odmMA4N5z8ndDrJPH zfl^cnl`+6FB1|am`w0LUr5-@&x!yP&y5ShmL>$6V=ndWc3BodRM{XFpV|N^S-WdMl z!|}*-!y&xjy6zB;!^82l5nM(;3v*EQU$_`N@-)u)5Z*ELM&1yv23Nvs;`DGd8oDDq zvlGGJ-y?WykKm_Z?)#s9`spqboJ|47HL&k)Ih#(G8o}Q1ngap9++j&r zVK~GiG_m0^UI9utt`lED%KKk+$_z_l4a>*+>jC)7H!24p=y??cIU;!DWCT~+=>!nm zEEl~aco<=ye0c20zn$vD)gZWBZl-|XcD=q*1h2y|BUmbuCk=|^WgrLtpi32Rayd9( zZnnGWWWLCm_+<5w$%KeR3a-WADaf6Qx)z%IXKo>RJH;T{Elm>4wQ~mvcGxl(p|Ob3 zGb3@_h#+W?#R_bhZEnl- zhu`%v25TRplLUj$wQU(IfH`&`1=mxsXcqI$)gtH_c=A|ywEDt zB*E351P@G%sD5`h6PrlzWZ0jI!D5D`-Yi^UseIIDFv+nc93A72!h==!jB!6 zH_*JH%H!&v`h4@O=IcjcH`c~>=vt)U+CYMT1@y?@`g$jVQ=FX5C)3?@IbGzo;Q3nC z*IHK4G*XdFM8tf`G6ImQ7;K;@%|KtL5f#4v(B2-Y5~I;bKk~fcaDQ(ja2JM+8^&Ws z1)`z1$J<2{4<4hTr!S7qNckmIEo*d5;Ig6q3}Jp1a8o*M!L@+|&jAn=F#kpf@mLpqs^Lb+k!puFH=!X3Wf%5JbH{S4wZS< z3gEXJESZnN-1EAlGE?wJcB|C}C~yK|FbD8-%2Sf~GKr;kP{ii=3fxZZY4O*q-+nkA zHpQnmHjUi5&c7(vYrP?yG#a_xmdvXSJdBZ=r@DB!BP!De?Xe``sg!%1!ivTSwO*+r zQir2$iS)$dVR()r{u>5caQWzD@I6)%1nkz675rXKX1N}@ELIe}Ly~|c!T~!d%a~r#I8aMi{;H5q(`G47 zY2swITq*+bi0{N}li*kb7!?5iQ~}^!6%u^vvB}`Zv+Qxz*=jxoe{MaU4=9}S7m$xn zqdXGuh;gtdNlLDRDsGPoh~JDLK5JKi_%jtgzEyzuUYL8YRg)kGFHsjSmtYP~(3t57 zKJxNw2w(-h7u2NSdkw}kBKlCbJJ&)FnfO%Bm4XU@oG_y9G*+$AWa1apq~OO4QV@N- zFIr+U27m6-aKTR|r4o*aNt#Anhl~nSPCz+?TS~5pwsa1EFWGGG1ccZm)q)x z9tEIhafm-|Fa8hs-6Mcka{ni-_gO(eNI#qQNo!YB@tCB0|Y4z zXr>}xl2`?W1RUo{G9ckf?@CAcJa6dMLtcA(v3k#XzF`W6l{j?mZ}DOixWh{(Ay^`~ zG$4p!nxD67k#vxzS&%Wa0Js#|k@kLByou0}BYVv>SCXI&d< zg=DO3nFNkt86}Y;GD;JYD&=gWuK@6kj~)!^A0Hg%e!NQ$qdPa9?`Dg|aj?^7T$xKtD0;ylE(A3)phq<@g9RBHcH=AvjV6$yD)79Xk_khC4)f^8L zSLj;BeiEoSP#Gf;br||%0T?7Kp)#hF2{7%_lv5=YA-G_1$UwUQVYQYgj|cm3jGY{B|mxXH3rLQy5l3aZkmPlBRZ~Hjg*(x-Kg8g{Dc9(Cm*#q>6pPdUGL_T z`E0UVt|z&9+6QtenyWGIMV^QChx%z@yL#q}AB_LfU>gnhzC=yFKOX^nPHEu(`$rAS z4UbI*U$SUI%tD?nmhmcm+-jaX&I*sbVjTDROsD|qF^O13p(lgOuorynZVB)s| zg7?|h{iixaG+E*D;iMqJKCqJ+PsA0GVAprj@m2>#ehz?-8hlJtEibhuKef~dSH1wN z%$FE(^92(!N)tbd6j7jFGLfoO5Q0@hDv%CIBN;^r+HOf8gyaa02+;I!k2d?13ZylSeCo$GliSt-n5I!H~-fz>zGaV&b&3DVibXr8rFGq426PyFd`hF@@ zCTRd3hnGh=F2Tyj(AQ;0rz&~=W9?`gqWU@;85$7m2fPR zvoWenG&NE8I49z3KoCuw=UUuVlR0y=rT{*5GT?K^p^7OUfuqVO2vMF&5d=bJnUxfJ z8yBEcGlI`95R4u>DC0K`4s&m^xFk3UT6i``skj+D4}7)D!e&~fzS{Um#W(x38Higo zAl_{&3+NGutGwstXgaAKKwOj}W?*a2-SWg$%8o=Hz)T!tHs=?O$X_)(0=gFkSe1lH+X0pj6XBX4u0+4ikN3jC9K=_=Z zjSw@{$KudsqOSG>S_rs1-E@N%=Vn_(&MrxA8pez}j!3{Z zrM}9z&FOYs%SulhjZi0vdbLB$=Z56;pGbFsB8~z&&SV?FBZAx)4*S>h3W^6_CAK@NAK57c-}+cPYJs7k z|76Q8Xor1z1h7_WQN72mI!q&EMI5afzzB^eoDa6$*%1a3{0z)})k1=}t}>s@=M&H+ z(^-)uc@%o$JoHlFKz{F_X{Ars_+%^S!0M3^Zv)gWyzkPwxKY#{P@R!!>19QOlg$xm@CeF zb!sD`UI05hE`D10#RYyJDJZa7arE7Q-^zv`>fsgUJI@xI)oNdILUgV4N?6*YX+qO!4sfU1#~{R zxhxYGzS*mW+uSQMDG>Zr;buLnf*^)K0~@E-!=1kkmzvx&7`zcq4Y+eKa(&+qvI5jB z$$~R*!&;!8QR-Jf)M;r6gv2U=Eh{-SMDsq!UP7%nWhEk&BmgT?`56;tUVGf6VCciq5YXVZe?V!{*YD z4ttjF>hBxE#$&M5e%GQGX4BOU1RnUT@OthloKAlOpqT69hcYfQ%%yczZE55WHX{l131Z|+wOB=~M1!57)`Ex~G;Pv>)N z^03N-o=3Y;B8|YTaj4@E#YsXWoW&D|MIx3g3Ru*g^4pv|B3N|Hi{9aPXLgvOeRv7r zQ-+uwHqChYDSNZIN`s7C8+TA$cAz@ZO?8r;@klzpBr@<&Z0j6BOam->C+R9b%o^0y zdM+pptjWtc8uJH{Gxzc2yj`M~hl5dRXysC-TWX#znz0YvJTR@fb)=PvZnqxl^1z~= zUn{FtbBq!|6nISlQ80e9HBGCfMJ_>UvCn-%c|Zwa9D}8?ACO4$3sTwD4|~-3(g4=A zNUs3aQvX6vufR5t+~B+mSZkW67@K5%@>V#OFuh)+OLf}0#LrBHy<;@o0+boqN_yz+BV$k@Xbdw-Z;AFR1ua>*je7!78={}&7 zzF^4}SKeH9bVl6~tos@yh>`gBS=&1_qwXhShOJ3}*_9Jy$~%w`+QGOpi6J4Q^}LlJ z(9z|~bZ*B>dvhDi>P{K>HiHsDqWnwN#@8(Z>=D3|Zu%;QU5WgtfrI@oOa_Z{b}|Ap z5d4wVVp>FHdP0x!>)#@%-lR$zC=fOCpE)P4)zHGd!N)iTKn@?gVr%F>?zm2YmDHfC z8g>eNw*f$O?|$2sL)CWW&a+Kq8i2d(19af%MQO~4jIb*h=p7{j&P65&{O6p|13+O# z_=Tg=r1M5a2iI{^>gieCTHPzeA~-c>Q5d$CgGc)Yv!l9t#_gpH0RFiGz@Pto85caT z@E2h2KX)l%UHP>ICvLOZ>=qMjnenW_z`^sLgZ7{eJPX|s2k?sn2JW-Z0Nm(Kv2wR9 z(_F&6uVxSa)6chq{Ex%jCv8At3;V@<1-v|86i)G_2`}(vhSqn&C}E2D)VXuQu=8z0 z%cf+PM{u$PQk-wstN(itw2e+__yvPBLzmT}gZ8KvQcF!KE~Uu4peDppP=Nsz%a zqF{#bB9|(x{V5(>vRN($x%ueO{AyF3)S38kgHwlO`C{vaPzEMGQpPFakSK7Rm@$qy zS6px&Ih?~Y%*OH^<+twY&`G>~(|Lt(c!(SvfmvwoB-wqcHJ9r02sW7bst`Qy?k;q6 z1(S;%Z9s5)<6uCSOk6h3MML}Fozp?QVpL<_+zYcRQ<=jRAWjJ2DkJc!!Jih<^TMXd zXGHCMn(JqnyS|o*RA*mkCsN)2IJm@?twE>Qb8|hJEO)pQ&w8IoC3^ypsD7HFTn?_0 zX$P--v?4Pl6aU#Sv^DV`0~G(Jt)&Mz#?z%vZJtafg%~`ZNtN;gC6ojM9KcRULNXR9 zc$j6>PkTzNw*u98cGR17iouU{AoyvkS)Ipcu>a<2T@-gwwbqii#-PGK;d?R7`PTAeq~M-VyNx+|H1i%&q1G!X2% zldB!pK%uppf&oF)Ft4|4Y1Gsv`4+X81&o6NVlox*T!NsiwQ$aRO?Z`-6JE*4;kd9# znoV(w$6*J9H&8of+b(mAzeMm`A8*UShhgpy&++M|OR%MPi{-57pL;rRpuQC$Npek6 z-aF4ouudph522JgSw6M}afyZxJ2@Nzovx>dvLEw*Bg ztHP`Cp-;Nw%U_|vp-wGk@V$?>WB9-FxTOK!Cf`1FJp(nnn$EYIqG{0yEJ+f)vV(1OA zGrI>b70JS`i-vTo%X*36FB%Yh1m=F(f*`Q*ZZ@B<;Ng5*tfrqY-uy^9lsdQ^!c;&U zmL{q2S>ULM`bh>}aNxs#l|4gfuv8=;G$<1KgtZ!fh8gC|%?hh|7qeAShT0Q+rhX7Z zO-x0&^RPHKu4YozUMhw)Kw6J97tXX>hF!vj>0Jfztd#3MpoL4rr;Wh7Yt0ct`w3&m4# zU0nACbz$f*=@1)2pvHj%@;;;b5uAal1~)JtY0Uzswe-lcK=3yzdgS53VeVn89+`n2 zS>ayDo6T}D7+}btCkni7N~8kZH)+yq19z(i+z(k(VXh9iI}KE2G{bz~RO8u43_Oe0 z%(q*CYg;atvhv`+;yj9l@Cgt6L~qW+&%dQp(74Pw)+8ddQksevtr<&2jmdJ+@B}$L zaSl&lgeS_BA`um7rihFPVT1w^MluosOB|NP=_uEOXg(MBtoLu$g=|>lrZ+IE)U7qJ z(Md%6sdAg6&ai$5tcT`S`PPJ9%X0bI0<3R&;vn>1Sebiam(bJ2h1fu2vR&^Mv)OV! zc>i5Mf+iJuyacP%>^*?rZ~Nojs(rlwXsvxb^5Y%#$(kjF9Ut##^k%vV(`3C}?pDjm zW_!)hJO1Nc9=(!zU-U%V?RYKB{hGz?z;d|t9DCqw%336=JUgFZCH(ah5fh?%?3&T% z=SFLU1_Y6@AF$SC8kqPb$tQW1QJV5JP5c;H(!f)73r-G4fYi#gb@ zt8GzX-7|`pXS&McgJ)+rusa)iubgXNOS44%?vqf|wl$j7xQjHnP20sCrN!Gc!k%fd zL~*@Y!^7ojac4RDwhbeSFRh^14AP+Q7u)*8w*9;wK{VV!XxLTz_6>WK>p_Ec_;HvU zSW=^B;GwKm^9AaadA>N(8+?BM*5L&%RxiCPxBf#9rhfpZs8ZI`)%H38*m>UZ7i?;? zLiuz8_yYzOMxW)CHuKu#pY(%XMolbyXNP@x?tKR98Uqhc+e$+WL56d^4jS_$c9;*p zq|YPAcO(x~7BImi%UGIaic=A*NR_(-)cI*_sJk-UY*pntng_@A_jOyD#Ap|F0=T6Za6+)JzKwt z3z+YAU3_T(>#C$*0CQFg+k2?wjOgV_<%NTnLy znn;72brvtVcojD+-!1F}u#|-V(!j@`xH%pCoXx+Rtaj7+5-jWadcG-Iw)e%wY1$(n z8}jaJBDHMyRLV%t^>mpCwhI`By2AkOREn>(T9B2J@V^-ld;>_rr@AE}Ah^Ux&0+%7 zxGHJz@`&+N;zp-Hr9}Bzc739SwQ?weL?4!VOs{rbDAZB4$aQ`4(VK(8 ztsl9e_ldnUb}1YpQ<7#rOBIVqbX#j}+7)D`gBl#&Msv7F^FWgi!rUipG>P`jbPH%s zXYzxx2?vLG%7Sny)O- ze6<119|0Np_bj%~bb>Lk^?bSm%ACyy58qv&34bquRzjmpQ)mjddg3@jIZQ^5l$eOa ze6V?3%=vZEBxh}gIy+xXj zJSzwD;b&vV^B!P~;Zbm%GL`btWc2%6jQ;%w()=+X&0n)HI&S(iTdlEs{BpTj41Tmg z^V5Llxx$djC}9CYE|rXB(vmKnlIbMON2Z$zpQ77#)blW4ktp&jCR4otDQVVgt=Y*{euoQKxji#*I|j!e_^eG6)fWgSHwtbCW9bC$j#BWM zD0~-`6Ou}u&2=a^Mx)ScB2#WHij8jZxx2r8ckjDcHo(0R^l#iR(bIb<42zu) zhximxz!$Nvu!qKvr*}leoex5NAf~^E?l8BQu(dUuJJb!c$7Szp>?o&qV;o`*e9NJ{ z`y@0C3x-!no8!C`?La*ZCLA^UV;BzXZH4E2cj(e+Ru&Ggl+(k*f|sGsK>4gu0Cj|? zXt+qDz4isrEJH^B{EO{?$2UG{uTKLw;n!lh+%323A~i*Egb16y(ulE?@tCGuf_{-c z=dt99gQ}rCP_g2mr$|5&0+=f&@M(a&?;<)|?s*f>F1o*Ysx|RUmGJ%&JBJt>q5OLUd?vT&d zGZ+`Vo_(eBb^ohPf#Dxb2H$I;K&&2GO&4q2<9o6$C{P3pj4L7lJ)sx?l`;5%=Y&NO z%qkv}n4yI#93iC4d?GX8O|TkBApJxVDbh%$z3-QHiD7vAP5=?CCIup>f8RoZ1^@vx zDR?A=uY@mB6=cle=^0Pc*r$SvH1V+@7CeNEGcfTL0~S`;HyLn?fNq5lDxxUt&cB)4 zJnR(s%U^0!;O@&4tv8^II5htftd zQsAfn@VmZz@I2o>ILy7>V!KUsRAjoG?bg^WYVgr}aK}ETGU@}-`NU_$3Ba2dnQ$_d z=>6w>mM~!A7&J`gFqZi|2I^zdr;#9cRPI&mlP{T(xm80(+A4Ev%+wUtA67Qd+-LSQ!$j3h&GxhX(@dYplpi*1CqITo#$p4L17xk_ zp8Y`<{w%j6HPB2M;rhr|+7YhTe$~1i&s4`}7OUkNcZDkY|EFAA^^C~Wk0M{f{W=aw z2nFE_7J{R|6v!y3K_$|df@X{1m^4ZPi6We#li(Nsn?mP$D881Yb_xXlpn~9A2Zy=0 zc5|KH5o$VLZNb{wY*)*{!}nlxE22?GQ^L`X@*@uR0+oIg(!Le_Hdfj`{pW!&9%#m>R&6F_8&ddj)>Q+S1?1 zm<)IdZryh%u$H4)3|oW8;93Uo4*)~!G@^dWfPJxdJ$UIn!-8JN4}h1!)1(0DI_P-V>@%mw zeyi#6s2yu8D^pBB4)yNSWw+SA)fw#V06 zKoE06-VDcsqXSk26H;XXIOCM3kqndsmQKaXFKwt-#Lq+c)we(St=cF557vDSFY(Dc zzaPt7Q*b_1W&0ogTAS@3gSmUH8w_r-qu*+_TTPeCyfst6!B&Z-WHDEfPdQVL57Zg4 zlz{;VCJUH&%8_ZJz>G-*O(>%b1SUKLndztCzBA(cGA7r5&91%)z=yxqRy`;REPI3+ z1t*a#^||1Q1V*MwN)ll5C<`cP?}YJ;3O^>oaaiILY;*v}gC$LPBxC|QEB3)`CYccM za(Ae_;MB&{!7esh$X4n)C&8?c6V~_miFo4nwpr5Aep!dHggBM%(?;uI^}moY&emv)t<}HxP!M_9LRzq(MaWlh(aZ z+6eMAV7{M3KDZ@dT2o-^D8M*E1Z2r0+~!CsY@3(}(30>3W5SH4F^|yMArv(YoTdaV zH}F8-!)Dkx+}tE7@6FAhgnCQX%5=rvDL};r+c@Qo3Vs~x?0OwqadD53QP^r?jIzUP zJZ~ddZWKEDdX8WR^lt?*< zIgw_#^Buu9vWRUafKY+g0a0-N98lviKq)~L0e3rDM%b@OHqTmO577l+yNjRJ9-P{xh_0{HPW0sk|iA7%Ih zlZ@e>m-=B^JLBbEZl@H5d&p1weHOg24m1{f%ZG)N5$?r(C4IkY+l}&*OB-)=1lC@8 zDf2d=r-QerzJ9^e9i6Mveh>;{n*CVUE+tjE+?Bcyfm5Q+9P3(Bq@t2B~p;YVYt_25;4%n zDgp-%^arJmL(+%_ne_cskQ7*(rD+TvUy6%!03BiJGa!+=?v1=@kKjT7BC6kI%~)(B z$Wm}9d~nS;I=|FWlF^t^LZS%#U`b-mV+Zq7q=0`?axN(N$qF50A~W#!u=Pu(#OyhlM1Ze(KS2z5<_jOV0F;m4Z-a%T)Y_WX8{h4sjj)q>Ho2tZBoSQ z+}Kh(+N{LYrr=5CB#0>iVIUacekrZAjn;@~9bP$>jIMwb+b#Zn(tsk?k$&rW)!~(z z(Oe{6q%z%cqIlfHefK-e8-*f@QGw#$xtVL_YA5m!!rU)) zDV9y%X?``Gt-zw0ZnyhJ5aC!T0?^-189K_w4eqWhco;jAu?jvph;_G-5O-1yN-mTLQi|BQ&C|S zMYCgEtH%wea1SW#7&6AFf?jFC-$mF|R~{c?G^miUaJpg9pgLMc;!7$B-giL+(ahdX7ZdDN zx4K67UE%1&@9*B*#_vx&cRCo_LjE&sW(9NZ&*$9NvM6-ayT!mr-jj**lvKc-X8y%G@n`?U{D`1tHg!ioRF?Uy|; zdF$k-#nj~8FAXQJhvR>CIt`=wcCp-Sv(;O^`!6KP(@$?#+b9ax+gZ4rZMNZRzMf6~ z>Pse**hmXv`DJWbIU|;p`EgE2nfaMt5@NBkOjEzibV^9((zJ{dw_t@{x{SuG)JvAe zu1<=Q#ku2no)>t*(Zj3$$&U1HXy%;DC?eedYs~<`#gMu+%d0oai*AMr@S1 zp<8}I{VWA&*)<|~>TMh657OYt2(AG^5Q5QSy;*FR^Z8hI|TPq5S8=u z!%|SHh{pr4+$+|uKJ<2PP|=v3XYA@@uhulX%T5=f;@42h!H-E0M7m6d6!8ETp9_~y zSPXxNK`cu1l3AtenYLHqbsjN}O~4}~IKagtA~^E%Up~|benNsE>gV59*)kGcT+pHf z7Pexa6*&M&^1LwoG|n@N5Zy>fO0tr~aF(kxmVrf>!FehCiAhWf-LlHKEZ~Be9{pg~ ziCXIibr2M~xHhN%(MxM{dap^|@NhVp9&kva%{&Ayo<|`Ea17dGDV)NR;}yyI5LV*t zgrCs^(5?XdRzns4=wv_lqY7gtoQB&doUJ$8<#Z8lCSQGMGC4?CW$GG5Nhw&0d68#% z>VrSx`{0$tx$nR3hV*YApC`YG9NLHHP79t~p(U;Zp1Y6C4xu*x33zG&_Tq>8hTIb{ z_t{}XE(*itHd<|>#daRe*Mk2pfF{lp-Sv~gu!_P+UCVJC4Sr+rWgQ#s4cqaYz%xC^ zw88f9c2wrowu^<_f7V5?^Bw>5twkR^=i9VZI+wKKfX1Y3`irI z0Zt@wOUpM(Bgy^31t@)H=>@Y4>blIQap}`Ep#`zt>mVDBLN#Zd-Mp%Sm@$CZofcy6 zP?{arh1eKiT|T->(rB3(YYCTr56u0Ng2uvdv5nS<*nG8_Z-mQ#GBZ+ACg4=$Mh1RH zLg9LYm1yhTXYSS$5BCi<0m`+Yip$w3v$)g(5#*SGXCqC(|E7jw z3nP1;33A9oA&3X@I!EJD5-tmD)7je;LhAAR;afXEIf(B(cmu-KHj&;y-f=kye{Fs% zkhc<3w>H$!DC@>O(RlmZ0f)6~sGY!@%qLbxQfB$gH4L3(S?SV(WpPSVKW0bL#89~S zc&}7)9)36dxUtmItx52EBqWH8{gguN$(Z;O_=*O6aXEzG&GEI`+xk9<82m8IeO_g( zfZ~WYo7G}FTW%uZua1CR9{Z_PFnuouJ&(uUZKMhZSt>xztD$s|tpk|mY#wla#=$^H0T3cR1d|HEXDS3=)Ijjx!rWxY+o};< zt{0o_Y_^qi`$PRj$S-dPwj9~byjrx1 z(LnKvbMk8{6#sgI4u4{@pZf`=I^Mubw2rpRRk+zqejN;$u}SdeM*}jG6J8&(J1xlG zt`vUP1=-NAf0tqGO1mM*j;1m&S;B2NT_VR$*Q@0pQQ1K%yGF<^eaw;ImQz`+Q$H-R zN>I-}u7XUWQ%^$1azPuA9Nv@!oVSu#q;yH_K`5zNnO;pC@=bt<^C%WXJ|=2JKnRzdWWak66O<~hr$>swY?S~(>o zFJqDq@ana&@9{Ui1^{QkyNgYo5peAQ-tg!FxP4%6!PPu`XR^2SCWm;qVbx7BJixAc zrda`S#g+wrZ;m0ooa9zPjUvwh!Gtg?0q8Pa2OP^XFLgpJi;>5)TQI=PH60jer)qY9 zn3@i_oWPd0qiJX+bG~*=&1_`nAUE3>gTu%i_fNs~)igABHFRIAWEMC@Yko)e&IW|r z3;WKqH6Gl9>m0jh&D9d{H?+Q2@#fn=l8P0Ga5~>c^Z5)8P1oU^)1Q7t2#raS##UD9 zWx^PX(;N`f3xgzOM&d+|tx`9vxTFMVttbto)-25p*QW-}T(FE0M(b`YL7YAY85o(h z8yY~S{=uQpfLsC^#JEZcD|4UuS(X|ZFf9SqYxv+s81#yiIgObGnnovX>A&(5y0DfNU ztB{`s9{ww;RYY#ZtQWKGBASb3*s;*F*v_SG&`C`uw(R*%!lLnd+8A7hJr}^)G~BGW zi+MD=E&$~KTqyUrRp}^_MTeMrkOIFX0q`p@_pu>wjc21o+xcp~+^)jKLIn3N(Hbeu zDJ>Jr&#a^Y(*f0QoIFYIY%Xe|Ke}}Fs z@?b3kUw^wV1D}UEO_ACiVnW_@zG<;=aS7Gw22Eq#k4vj4@BMY&HVM8#LW0l1+^-L* zVT?0Jv+ZiV0znAY^t@Uj8VSDge#=XKgMdWwV`{iW!Bcgym};NAjEg0<2~M)9nZkqJt%(FY2U>jM`9h8+ zvYcyJ-~vs^O`cTbR5^t313gR}tYE{FVL}t{L=xd35C1(05B~zp{X>;C9nQAVa)}Op;eN+NP-Xc>+NEZO|Y0Pe|jGMs}k}(L6ty)m{^KOF-1+Q!eQcaz8L3dp1GUa zg=B{7gU$u86PO;4013~yH63TKQTL(ZwtmLM-JlU7tyCsw$X%!qPO1tjZLh1Hn=$Cx z=oa5h`{!tipw>rk_si`u(Derg)|T2qtqZ}6Yw%)rY})SK8}hVWQsMXUh8BM3WIy*# zr9-^j&SulacCnd9o5}khnoQ2pw;3R9vDmpW!=+I+^i?FPJ}7rup!{8>e=#m7H-SgD zmg=ibZTWGDw)`5*{Z|#GMCf#ds}1-J%SiYPm!Lw4Ri@OxB8HWv^y+fLxel@$v^Rc1 z;|tH>hAml7V!N;f5irTdBc z`I`yX)zae;7)i7J&PA|YrtzO8T>KFDt^aa}i!t~Fjz9RV(Q>kp9WDs$Eu|? zi7vjB@N}OM%P3q{x;&A}sbia>=CAs@YkMT3bq&F`e^j2|Qcro!V_cx2I&$p6Z%ByF zX6DwA_g{xkUiwHshI|L+zEhFX!i!GlGmwL`c__@mk;pGA{QHdrzg@z_zYKG)RHbBK z;rMhOEwIneQbh1BrMsf=v($%wnW{m}3BZo>9W-t-h~VQACPwxAoeCzFnS*CjoD0_# zjHQg0F3EGlO6s?A|qI+UDR=A_q5*99oQ= zV*?FPC$r6LCXAVp=ug-6YtSTZ$K=TKI#Yu*cm{$|I9>3DaiB>SvwA^%JOphn)-XJv zp#b9bR^0Emu>lg#-F9{^FS^WLwYJDp%1v9V`Ys{!VQ>N?4qlb=t)bsryK7>Nh#Ni* z_8dWMLBiID@unI-f+r0kxRG!%qN-lc?bF~z+82y*%K|w-DL`)Q>7?r&pGKLZ(XrYk zIRil{4UUQ+C<_cai6Ox7s+C8iKhJI`g(^hIqc%ImGAKRk8Rsy*m89i?G8uw(2aPvA z((j=0ZI6bNXOw9zXi!HBEHvLt*SiiH$;IH2f{~)oiyO}}mH+Stib0DG-z1?!%*^iG$w3#&zc{vh_)7s=Os|qWL%&Aot_^j5+A|Wz?pI;%cZakI`Xv!sG|0~L^|DG1 zx{&S)w{%@%=tX&Vn}F{bw*9#`MgeYH3Z5Pq>z69{d+oLVOgXF3Z>)Rzfj3KZ8 zqsr2igtyLRSif*7iG9B;Ez4qgSu%PwdIN@b1A$}jbl~SGdHCIr^mALVhUVLcQ!`}s zGDlEd*`zwRraFdHm*;tj6*dV0WO3~Xn{Q&Oj>t+ug=lKHs$}9JP4jAt4_<9sAm{KN za;C@xG_hwLPjbV@*K9{5!PsaL;=OI2qQ?7eYA`%*C1SyrhABce5}mf!>2Q6obpOE` zz&jEEG2-dLnF1p(mSMP^P1jT57heussbl+% zyp$yP9pbT5PCjpV!^ zJfiv@PC;Gmll+PVL8QySr0me#r@<5SgpPY9#8%F%Jh4hFa4mIgTw?fcnHIRC&Seig z`jE$FYBBh$5;2J8jj2MgZwrn_fV7Zw#`M~3w&>Fb|&% z;B1DqmtnNHaR84xSyug^y1J{>2fK)Uy`hKeE@H^nPpB5#J3Kt)DKkAid)3ghMnlh1 zVz?kOvxLz+wMst&KQOj3IN~RS9N~lE!*#bKx~|ccfL<$Y@DC&s@OdD=zpPwnpK|hB zZD*V5Dy*u0FC{xT!YI-Gf%YZmJ^OfP4eF4y5sa4WHAeTt)jf>hh~pX%Tt%QnHj8Ne z{{ezB8F&GLd~?qkScAaD;cR`a2=>b$>nV%_)Tx9HkGSmes&`$Of}q+EAZX-Ii_^@i zw9wZ0W-I$k8?~1v-z0zsb6Oq7+v(G|7Si?RaLwc+%E*3ueMF=Uo(mYs2mx=-<@>4{ zgcyNH;Rsq!F877oD#vUc3wG;Iy)U6Tc?7k0HA&DiWYB2oIwNS@SgIn3z9o1GVx+KK zxXi#^2J)+J-sMI7EXZ)5^v~Cq&EZ>CB;Ac=$bwoI1K73%#|5yNMP8)|_0Z(&h9uc-*_0-5pd>!6zQ=OTk}*IbF5G>=a%W(GvIR4OfB$kFxUIoESMFloWnU za+b#g4$=a55J>VY&-6IU29l?5Y7FOk)LR^V^wGUDQa2*UxxwD;+?HN!=c~mMi&p0m z??XNahhD}Rc3E`wl9n;^b(VpuG2%GZ4b(jr7OZDh!u--N?j`JceaQQW-Dx2<*t-2X z5xZ%Ws-f=O{!b-b`a3Z9uZL+&?A|#KH^8Nf^;Fo)O+m$SLVacwWx@*T7Z$UU+)u%i zBdK8!nt^La%FLxk4zdt8{bX@wWegm;GO_4Q$6StG0_lp9JAL>u}aC>P-RUhhcshx-nqIZ^%Ut>@@chgydo6HF2C z90B09AlUJGM@8@+8`p(k+v~k=5j?}=xEMjH*Lxy@t^z@a?}3MXV;wBg^y>9&o;zCE z&^%30(`=1f(;B+KT40Jq)Y{L*O+;~}Tyzt|vQOT@SgR{wTh*O`NbnbT77cW26B_GA zwSLj?qC)Ts4PE^H$$swrDqS4T0meDJE`gWB$%L2wUqp94Ny-uq-FP}+cQkBoIVLIt z-u8eelGzT{BG{lk#MuE>^J^NG{A*lmz%kS2#rc@n#`k)L&)9#>!2CQ>~^4u zdI%d%)^LMpn8fxA&0V2ijsQMXtJL3-s8n>=mMXa>an3H2W(Z&j0OkN76WVZRvVtzG zj4s!?ukR6#)MoKW*rmBFJ-7Ku(qY%zfz^0qDKZLgLvoYWRXvmBV*DbQT>y&LkKY;NaxK%j z!=O?8`;zd=D*?aX9#XJt5s-=&Q{1aOT2GhbBPbvgrKJ4D?CgJT)?0 zO#w%Esmis^L@V(V%d#@xN_5I#&@4HZ2D>>FgUOVph`BpEjGW#!Abb-QZ{OGx5FQ`F zfJ$b2v~C`9NR*&wLIaq?%+vR+V)hfX;nU91?wP!uY9bphmo^iqudd+ z&&3B!I&5{%d@IYWG_y+PGMeYC$QgBEE=vps{z;5YIIUE7+Tz@+lOqGK#eaL@-$80W z`Wb=B<;fjdoxZ#3;J_Ot2qIlBls(z{Tzo!YRvs5vUsD=J2}g)Oh8j3#^c#EbnmX;y z@Wku2AlTvJF%cBa3k1p3O-v4f+FwtsNU48erJKMC*BX7?ds)HFL+vdZ+bg>0dURKG z(M9>+ssL5r9XSQMJf(#8pv#G`7bNjh8%$qx6TmS*_W-n~ZR6v@j3cJREBXq!4JIm3 zETKVf!twY&o=*XO2yRXZ|CzkbMgw69x8U9HvDikl1JO$X+!LD@)!4jP*5-R+3gR=- zmC@{Wd`h?@wK4d&9_#yaA0Xl6Bg&0sR$vUGH?rAOU0~I!Ax@2i62f9y8kS$M7_iIW zNUY~RHC&b!pma!_7zH4VT`+hxkcZ9Xs{NmM)tSlb$g#(UhV|*#*A84b+uXIR18-0C zT*Id~u|lX_1*p}rLXkJ1?G^+OYN12ys<$zEcUXN~DJHYg+Vfk0-%D!(e7gj{pMOl* z6fndcDORi5bURyyi>eFpdE6>UE^|Tk7zWJ%>q6LqW+OYo=0=C$Y0-D20<4Z_)}=?h zBe+-Xp{bxYy{lr8g1-uluVo;nsm%o`I$Djh zL!8=%d7SiZm`}l6pxokjh1z$qUCyGVNESK|x2%C%Q7}I>;!If4fzt%W>%QwnS=!4X z!j9w(rDcHZZ_qOj>)HBFVkZ`-6;X8%z&}YzW4ffJyE&%a8wsIhS?DCUvb^wJFb)}WT>=6#VPJZfEQd2PnwKPLL)LxtV`q;Y zM+jRr3H~(+3BC>HK0KsD__l7~f=nYYWoA)bW_b*E?x-v?s1EAJ$_+oJSw`a6_i3WT zQJ-C3g^WGK?#KMJK>*(-5rPQnv&wz(dH|kBY)m(BPinm=3<^iV53~&3+jXfM8h2Sk ztL?ydNvxPZggNuj9>0qa^Im4FYHP;JDKC$yYtY0{HI%(;nDwUy5&SU;f?tKX2UXj2 zqZC{$F=2eVsv9X@NO(G6huc%rb*}V_aHP`i=vhl|FV4P+)iT%)qn)?*kQh4>lwq##f(FZn4ser5bRnFFrJx zyd5sRN+|eog-$CZ35$y&0VJ>7CX%s0@~TGC?I7s_lA3AWx}~)SwqkO4ESkUY)E7SZ z8KbHrt|C6GrsCI*{#%I|TJ@dE+RD|K&ffU1lU= zs&zo&=SV+6VWxvIb6|te)f~H)tXGlH#Ai!AqkzLKHo)+4EuGREs4+Ug*L)0$!C#b! z!OsxIPTh46#ra|lgc#NR*G51tHwevvj7LY%Iqq)ivraf3MuByXVoB#ZFteU2ReHv;~O^ z0^-uuiBDqP(y#n0KXG&4XK`-SG53aBRqdN!?cUbJl__(quI-{{G||g4dbh3R zgF_yasd@M_5|I4FfbXeyP&*{QHGtrsQT4`` zNWpXHujtcXX)#y+m;<>6XAjRidXM=og6*CM|4pJtJ_(l04=LkXOK?b55xiY3R>Gyb zi0bk(p)_%spXIJw;!0ArIPB~}YXqz+K=+Hn!OoBHsx(zcOxc~pq?XvLgI?STZz>zF z^@AK*MGdxL|!V-UbUEdlTYx?&6K zx@>7Np932&X9B>Hz{BvD0B{rtFV^GyDE~SKmB`mVz~2%dm?+REFYCr|vx4 zx{iVW?}&}nG@gnms;9aA$@3KPwkB^_emw}T!Hxz5=am*26~X(0iAO+ib$tjr?M1q# z-%%y1+O?8xESd+c^Y;mu4LYpr8TaS*()PRWg*`fK$h<@EM`cjA%#>B>Jw0-~p3VKl zaLd$)W0r#UE{Kuj#7f-~{OF{-vivjX<9aDpz0Cdr34o|u1~(6u>yb05E%l4I$aUSy z;1UWhp=J4Xjf1F6sM|x-zVQL9q=yHt>3bammn#AM83}-BUcaDh$HD zoZcijLMbbX!nG1GAPcyG$5(WCsRG7A;lnNM7WRvjJf$a6(0LI9w;$_Myc%>~`4C{u zg5zObgS|_U-iH+Uz7}S%`-nkX{txT0dBS-MKw(PPLhzG%-yg;}&HI&+>3&S+EMl_& zYS+jp(61HyaS8YqSFHZtgPk(Q?Z$W&yThN8mvo3sIIL#pczAZT6De2Aw#UoU1)`{5 zv4o^H&udM*)6-5Z9c%ktWa(FI;Po=j+cB9>Nz}-vVeaQt0HHu$zqTOy!|Q1{-2#5A zYUkTy6-k4=Uob5!_(u++sbSZtfa_H#)8X!7y3^ zf{R%+I)b*~9v;-hIxOByDA*O_U;&uv@p3TS6ceN+u8CI_aW#uNoI`D!yR#y;#dkpz zTOsihYmXOpgJ|UaF5rRy8(M(X&i4h~+Rw&-*5Sny@NcCt4= z)N#SH5$tgBA0c=;f^9Ay7eTw?ph_bO^RO<&_V$iN^D!Hph%3Dsv`wE~+1-9#E^fW9 z$s~>Lf4xHMzcl3FvyHiDRdNvbuU^cf^>(?2|0m!3;>qN6)JE^^B)wT9N9Z&1=f2Z= z?hh*+u~D9T#77>$p~W&o&wb+eJ)1|H?Q%0)hIixpKGAL3`{kp5-lOtQn^ z;<;-H`9+DH^9`8$rJ?Ve7r5hIno)$iGcT%kQO7zbde1q@(gLI^xHk;sV`7vk582>v zjV_nQoiVgocn26>rHx)295QW>TRNsK8f}4ZW(MMjhE9u(c8=Hz+;Mn&#tKv$>=u)^ zOfgMi_XJoaukmOk?hR>TMI@L`Ht)!V#!;&@;;OX|-`6BM23Pb3dsq&v-kqmTEKo=p z{9!c1hVeclx_-ZC>3V1DZ|L>T)_VkzE}v56fplo_3a)t_+S>v`_aq35K7^UBn;M$a zxUZ&%F#{wD;RIvbPmDJMk(s<%CpWS2bWTmUUnvlrR<$rgn;oyfb__^j{5%cYE7-i5 z8`3BeqaBDQbGVx+@8^K62jB;)T)#br(PWp+m6&x+v>-7VoTCtYw>IZI|~3~B6zP2wY_}-f~&iPpr`z(eDtFy zrDM01h7!RSy{xZ~UjcK^EA%lyh&geK`F6hC)a@2Vg0FBzk&Y|@=BQ^^m@EAQRfMq4 zyHO<(4T?mOKDb%SD%qcyR{Kc-loc51+Qy1T`z!bXHm6WcPU)h z$S0+792*HXLh)>8*Qw47G?52_&99t*n|O5-Bhjs$H@3uXzkH>L*D+zalZQ4UT7xEr zPEFoD3KyJq-BrX-L3_{UHjm>~eT}fgm(AeJvh>OU+YK@R{-6{&;VWQoE# zZpE@#tSXb`V&qs{Sd)9w++a=FxqVAtIe|Nbm=Cz zt65nvi(X~aPpPi6tBjSn_^DszS(cPp>g%PS0$hecQVJhaixxUbbmFrtF0p*kPm_~6 zL}`v9UfR4(gBpA1%Ppk#9R?BnpajA901duZx$W5uURJZ&c0ONBMR$eq0CW++_7fe( z^KiMC>PX)(?_V%j4)#Hb7*ry-7CoY7>(z`C;;F(XKl zd8S)5u?W$5o<{3ooM4xL(MSjB<*_M;9W*ZU2DskqysK8al)mp*BsBO|puyV*+ABo! z^>njc&Ngdd%pBlMi=|df087irX<<;@15(%h4E|4;9}}HkQEDYdYNhvE0wx+RT^d9K z=btN*w`9hQoSX=CSxJcDCyWqR2V5D2vw-6h+)xv!5Svuzahkh^!KhD-gyrHlmb&^u zUu9I3SSK0Oih0N~Iq9oubk0=N;z$rg*{F(O$>iWz@Rg-|Li|KFoYP=B;#_qCVQBPK z@dsh<8$;$`wdOushqG-o4HtF4#&OUi0d1z%wyAauSn=04=2y46bUl%a3E_wjl*;4a5V{O7xU|c=Fxb#hSM9S%VJ!HC&_xa?9`HiBCJraX+A9>!B?9w=RT;Y z%#Ls+WQ{u;ZbG3)#-qS{Far~pM0zCn^i%PxYes&nL=>Wx_`eQRXLl{a({+Y~0u#UR zS!0Jyukm_=L#>U!U!slQ1#`bRtd03LB;j(kjJET662>33r=wD6pSi4oOjl#NpQ z2y*5vE@EDb5tsO}0iydB=4sr*suHoAD?S3I00m~{*9JlDwN@1%__+>3JDt6B+Mo9R zG_N`gs+K-M@07;nvVw@3iQgsB#>m)jKCnf7hlxiZJ*qPj1L*w+X}2X(4QYJ-sDbBJ z%jo;owsHX5O#B`R31Z0by~_R8`vF!te#zG;C zz&a*?RV}3xhz7`>G!)fX6_Q1i_XSa1bdLFYoPAvx#?~Aefd3Dpv<|`m{{R3ViwFb& z00000{{{d;LjnLeKjnQ7uq;bi-`@Kk?_G4!*ZrEctH$7T*AA~w6gm5U&nn65VYCKB zgFFf`rj*h|F)0wW6eLBYo7Oa1F~ZfTsWSdS5=AR{MkN}tf)y+NK@65as8pg@QA8|> zF%m2R$@lfl?9QK^Ip>~x?t9l{?mcIBduI2X`JM0o_x1NZ;| zI2vV4`JAzw3Z19al1$~?TS1uiln|2>t|^mFqA~I1Bh$*{3w9i-E20W z?N+n(d^cO4Ew%wboq7NwGm&Js#V9LMl0U*Nv32%d|5W(`1v0AA|?c(VcE-|wTqivUP&2SB)u82{|!D2d=< zG9D+e0+0AFi6-GT<8Zt2NsO09_%w;e@g9ds0#_z+1fMdQ;OQ}ZR8+hvDc%8}GC3KK z{_Gmu^&S$5K-Qsk0I^V4~+r>N} zcy$zLFGzv6bwuZ0(f$=WJbXI{&H=&MY`Id55`fM}X zZBEzQ)4+|rcdrGPj4%s02}zvJC^5uXU|Sl{m`IJU4lio1&sg{wJ$;Yu1f@8`{b@?@pG{Dn)sgM2%>8?JKZhL)~ns}Y`HEK!qtGMGgSe2 z6vU)q!XCTe_?gaX(tMiijqPZ^h&FL?63=!>@P&`)r=`~pB!^^py~fB$nqM}kR-0rv zh4(xRNKP|Cq|OA)q_^Y&D7jz)O-{ zCIyou14){}%hahnE}=T-NN!VK48M7Q7Jri(bTSZO7rL7b=w3RYZ|@{@;IQ5RhZ0CD zNdjR2>4j&d){RPIBd*^je#L=Cl5n?iG>PLD3}G2RDB6S4MX>c6i7xIySE-PnI-){8 zX}ui%uRZ~YP6a68*