diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..92607ee --- /dev/null +++ b/.gitignore @@ -0,0 +1,14 @@ + +.DS_Store +.spyproject/config/codestyle.ini +.spyproject/config/defaults/defaults-codestyle-0.2.0.ini +.spyproject/config/defaults/defaults-encoding-0.2.0.ini +.spyproject/config/defaults/defaults-vcs-0.2.0.ini +.spyproject/config/defaults/defaults-workspace-0.2.0.ini +.spyproject/config/workspace.ini +.spyproject/config/vcs.ini +.spyproject/config/encoding.ini +.spyproject/config/backups/codestyle.ini.bak +.spyproject/config/backups/encoding.ini.bak +.spyproject/config/backups/vcs.ini.bak +.spyproject/config/backups/workspace.ini.bak diff --git a/LICENSE b/LICENSE new file mode 100644 index 0000000..88f1777 --- /dev/null +++ b/LICENSE @@ -0,0 +1,21 @@ +MIT License + +Copyright (c) 2025 Taliaferro lab + +Permission is hereby granted, free of charge, to any person obtaining a copy +of this software and associated documentation files (the "Software"), to deal +in the Software without restriction, including without limitation the rights +to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +copies of the Software, and to permit persons to whom the Software is +furnished to do so, subject to the following conditions: + +The above copyright notice and this permission notice shall be included in all +copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +SOFTWARE. diff --git a/README.md b/README.md index c824098..1eebb1f 100644 --- a/README.md +++ b/README.md @@ -1,2 +1,210 @@ -# OINC-seq -Detecting oxidative marks on RNA through high-throughput sequencing +# OINC-seq

Detecting oxidative marks on RNA using high-throughput sequencing + + ,-,-----, + PIGPEN **** \ \ ),)`-' + <`--'> \ \` + /. . `-----, + OINC! > ('') , @~ + `-._, ___ / + -|-|-|-|-|-|-|-| (( / (( / -|-|-| + |-|-|-|-|-|-|-|- ''' ''' -|-|-|- + -|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-| + + Pipeline for Identification + Of Guanosine Positions + Erroneously Notated + +## Overview + +[OINC-seq](https://www.biorxiv.org/content/10.1101/2024.11.12.623278v1.abstract) (Oxidation-Induced Nucleotide Conversion sequencing) is a sequencing technology that allows the direction of oxidative marks on RNA molecules. Because guanosine has the lowest redox potential of any of the ribonucleosides, it is the one most likely to be affected by oxidation. When this occurs, guanosine is turned into 8-oxoguanosine (8-OG) or further oxidized products. When reverse transcriptase encounters these products, it makes predictable errors in the resulting cDNA (see [here](https://pmc.ncbi.nlm.nih.gov/articles/PMC5623583/)). OINC-seq employs spatially restricted singlet oxygen radicals to oxidize RNAs at specific subcellular locations. The level of RNA oxidation detected for each RNA species is therefore a readout of the amount of that RNA species at that subcellular location. + +To detect and quantify these conversions, we have created software called **PIGPEN** (Pipeline for Identification of Guanosine Positions Erroneously Notated). + +PIGPEN starts with RNAseq fastq files. These files are aligned to the genome using [STAR](https://github.com/alexdobin/STAR). Single and paired-end reads are supported, although paired-end reads are preferred (for reasons that will become clear later). To minimize the contribution of positions that appear as mutations due to non-ideal alignments, PIGPEN only considers uniquely aligned reads (mapping quality == 255). For now, it is required that paired-end reads be stranded, and that read 1 correspond to the sense strand. This is true for most, but not all, modern RNAseq library preparation protocols. + +Uniquely aligned reads are then extracted and used to quantify transcript abundances using [salmon](https://combine-lab.github.io/salmon/). Posterior probabilities of transcript assignments are then derived using [postmaster](https://github.com/COMBINE-lab/postmaster). `STAR`, `salmon`, and `postmaster` must be in the user's `$PATH`. All three of these preparatory steps can be easily and automatically done using `alignAndQuant.py`. + +Following the creation of alignment files produced by `STAR` and `postmaster` as well as transcript quantifications produced by `salmon`, these files are then used by `pigpen.py` to identify nucleotide conversions, assign them to transcripts and genes, and then quantify the number of conversions in each gene. A graphical overview of the flow of `PIGPEN` is shown below. + +![alt text](https://images.squarespace-cdn.com/content/v1/591d9c8cbebafbf01b1e28f9/77d2062a-a31e-41b9-90ad-5963f618c6a6/updatedPIGPENscheme.png?format=1000w "PIGPEN overview") + +## Requirements + +PIGPEN has the following prerequisites: + +- python >= 3.8 +- samtools >= 1.15 +- varscan >= 2.4.4 +- bcftools >= 1.15 +- pysam >= 0.19 +- numpy >= 1.21 +- pybedtools >= 0.9.0 +- pandas >= 1.3.5 +- bamtools >= 2.5.2 +- salmon >= 1.9.0 +- STAR >= 2.7.10 +- gffutils >= 0.11.0 +- umi_tools >= 1.1.0 (if UMI collapsing is desired) +- [postmaster](https://github.com/COMBINE-lab/postmaster) +>Note: postmaster is a [rust](https://www.rust-lang.org/) package. Installing it requires rust (which itself is installable using [conda](https://anaconda.org/conda-forge/rust)). Once rust is installed, use `cargo install --git https://github.com/COMBINE-lab/postmaster` to install postmaster. + +BACON has the following prerequisites: + +- python >= 3.6 +- statsmodels >= 0.13.2 +- numpy >= 1.21 +- rpy2 >= 3.4.5 +- R >= 4.1 + +## Installation + +### Option 1: conda + +PIGPEN can be installed using [bioconda](https://bioconda.github.io/) using `conda install -c bioconda pigpen`. Following installation using `conda`, PIPGEN is accessible by calling `pigpen`, e.g. `pigpen -h`. Currently, `postmaster` must be installed separately afterward. This can be done using `cargo install --git https://github.com/COMBINE-lab/postmaster`. If PIPGEN was installed via `conda`, make sure to install postmaster in the same environment. #TODO: make bacon and alignAndQuant easily accessible after installation. + +### Option 2: manual installation + +Alternatively, you can download PIGPEN directly from this repository. PIPGEN is python-based, but requires a number of extra modules as well as some R and Rust libraries. These are most easilty installed with [conda](https://docs.conda.io/projects/conda/en/stable/index.html). The necessary software is listed in `pigpen_env.yaml`. This configuration file can be provided to conda and has all the information needed to setup a PIGPEN-ready environment. + +`conda env create -f pigpen_env.yaml` + +This will create an environment called `pigpen_env` that contains all the necessary modules. To activate the environment, type + +`source activate pigpen_env` + +Uncompress the repository and move into the compressed directory. Install PIGPEN using + +`python setup.py install` + +Then to make sure you are ready to go, ask for the help options in the PIGPEN, BACON, and alignAndQuant scripts using + +`pigpen -h` + +`bacon -h` + +`alignAndQuant -h` + +If there are errors, one or more of the modules likely did not install properly. In that case, using an alternative package manager like pip may help. If you see no errors, you are good to go. + +## Preparing alignment files + +`pigpen` expects a particular directory structure for organization of `STAR`, `salmon`, and `postmaster` outputs. This is represented below. + +``` +workingdir +│ +└───sample1 +│ │ +│ └───STAR +│ │ │ sample1Aligned.sortedByCoord.out.bam +│ │ │ sample1Aligned.sortedByCoord.out.bam.bai +│ │ │ ... +│ │ +│ └───salmon +│ │ │ sample1.quant.sf +│ │ │ sample1.salmon.bam +│ │ │ ... +│ │ +│ └───postmaster +│ │ │ sample1.postmaster.bam +│ │ │ sample1.postmaster.bam.bai +│ │ │ ... +│ +└───sample2 +│ │ +│ └───STAR +│ │ │ sample2Aligned.sortedByCoord.out.bam +│ │ │ sample2Aligned.sortedByCoord.out.bam.bai +│ │ │ ... +│ │ +│ └───salmon +│ │ │ sample2.quant.sf +│ │ │ sample2.salmon.bam +│ │ │ ... +│ │ +│ └───postmaster +│ │ │ sample2.postmaster.bam +│ │ │ sample2.postmaster.bam.bai +│ │ │ ... +... +``` + +This structure can be automatically acheived by running `alignAndQuant` in `workingdir` once for each sample. Following this, the samples are ready to be analyzed with `pigpen`. + +For example: + +`alignAndQuant --forwardreads reads.r1.fq.gz --reversereads reads.r2.fq.gz --nthreads 32 --STARindex --salmonindex --samplename sample1` + +`STARindex` and `salmonindex` should be created according to the instructions for creating them found [here](https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf) and [here](https://salmon.readthedocs.io/en/latest/). + +## Running PIGPEN + +Samples are then ready for analysis with `pigpen.py`. From `workingdir`, a comma-separated list of samples is supplied to `--samplenames`. In the example above, this would be `--samplenames sample1,sample2`. Optionally, a list of control samples are provided to `--controlsamples`. These should correspond to samples in which nucleotide conversions were not intentionally induced. They serve as controls for SNP identification (see below). They may be a subset of the samples provided to `--samplenames`. + +## SNPs + +8-OG-induced conversions are rare, and this rarity makes it imperative that contributions from conversions that are not due to oxidation are minimized. A major source of apparent conversions is SNPs. It is therefore advantageous to find and mask SNPs in the data. + +PIGPEN performs this by using [varscan](http://varscan.sourceforge.net/using-varscan.html) to find SNP positions. These locations are then excluded from all future analyses. Varscan parameters are controled by the PIGPEN parameters `--SNPcoverage` and `--SNPfreq` that control the depth and frequency required to call a SNP. We recommend being aggressive with these parameters. We often set them to 20 and 0.2, respectively. + +PIGPEN performs this SNP calling on control samples (`--controlsamples`) in which the intended oxidation did not occur. PIGPEN will use the union of all SNPs found in these files for masking. Whether or not to call SNPs at all (you probably should) is controlled by `--useSNPs`. + + +## GFFtype + +PIGPEN uses genome annotations to relate transcripts and genes. There are peculiarities to these annotations based on where they came from. Generally, PIGPEN prefers annotation files from either [GENCODE](www.gencodegenes.org) or [Ensembl](https://www.ensembl.org/index.html). Tell PIGPEN where your annotation came from using the --gfftype flag. + +## Quantifying conversions + +PIGPEN then identifies conversions in reads. This can be done using multiple processors (`--nproc`). In order to minimize the effect of sequencing error, PIGPEN only considers positions for which the sequencing quality was at least 30. There are two important flags to consider here. + +First, `--onlyConsiderOverlap` requires that the same conversion be observed in both reads of a mate pair. Positions interrogated by only one read are not considered. This can improve accuracy. True oxidation-induced conversions are rare. Rare enough that sequencing errors can cause a problem. Requiring that a conversion be present in both reads minimizes the effect of sequencing errors. If the fragment sizes for a library are especially large relative to the read length, the number of positions interrogated by both mates will be small. + +Second, `nConv` sets the minimum number of G -> C / G -> T conversions in a read pair in order for those conversions to be recorded. The rationale here is again to reduce the contribution of background, non-oxidation-related conversions. Background conversions should be distributed relatively randomly across reads. However, due to the spatial nature of the oxidation reaction, oxidation-induced conversions should be more clustered into specific reads. Therefore, requiring at least two conversions can increase specificity. In practice, this works well if the data is very deep or concentrated on a small number of targets. When dealing with transcriptome-scale data, this flag often reduces the number of observed conversions to an unacceptably low level. + +## Assigning reads to genes + +After PIGPEN calculates the number of converted and noncoverted nucleotides in each read pair, it intersects that data with the probabilistic transcript assignment for each read performed by `salmon` and `postmaster`. Conversions within read pair X are assigned proportionally to transcript Y according to the `salmon`/`postmaster`-calculated probability that read pair X originated from transcript Y. This transcript-level data is then collapsed to gene-level data according to the transcript/gene relationships found in `--gff`. Transcript IDs in `--gff` should match those in the fasta file used to make `--salmonindex`. The use of [GENCODE](www.gencodegenes.org) annotations is recommended if possible. Alternatively, Ensembl annotations can be used. The source of the annotations should be supplied using the `--gfftype` flag. + +## Calculating the number of conversions per gene + +We have observed that the overall rate of conversions (not just G -> T + G -> C, but all conversions) can vary signficantly from sample to sample, presumably due to a technical effect in library preparation. For this reason, PIGPEN calculates **PORC** (Proportion of Relevant Conversions) values. This is the log2 ratio of the relevant conversion rate ([G -> T + G -> C] / total number of reference G encountered) to the overall conversion rate (total number of all conversions / total number of positions interrogated). PORC therefore normalizes to the overall rate of conversions, removing this technical effect. + +PIGPEN can use G -> T conversions, G -> C conversions, G deletions, or any combination when calculating PORC values. This behavior is controlled by supplying some or all of the options `--use_g_t`, `--use_g_c`, and `--use_g_x`, respectively. + +## Using one read of a paired end sample + +The use of one read in a paired end sample for conversion quantification can be controlled using `--use_read1` and `--use_read2`. To use both reads, supply both flags. `--onlyConsiderOverlap` requires the use of both reads. Importantly, both reads can still used for genomic alignment and transcript quantification. + +## Mask specific positions + +To prevent specific genomic locations from being considered during conversion quantification, supply a bed file of these locations to `--maskbed`. + +## Output + +Output files are named ``.pigpen.txt. These files contain the number of observed conversions for each gene as well as derived values like conversion rates and PORC values. + +## Statistical framework for comparing gene-level PORC values across conditions + +We could simply compare PORC values across conditions, but with that approach we lose information about the number of counts (conversions) that went into the PORC calculation. + +For each gene, PIGPEN calculates the number of relevant conversions (G -> T + G -> C) as well as all other conversions encountered. Each gene therefore ends up with a 2x2 contingency table of the following form: + +|| converted | not converted | +| ----------------|---------------|-------- | +| G -> C or G -> T | a | b | +| other conversion | c | d | + +We then want to compare groups (replicates) of contingency tables across conditions. BACON (Bioinformatic Analysis of the Conversion of Nucleotides) performs this comparison using a binomial linear mixed-effects model. Replicates are modeled as random effects. + +`full model = conversions ~ nucleotide + condition + nucleotide:condition + (1 | replicate)` + +`null model = conversions ~ nucleotide + condition + (1 | replicate)` + +The two models are then compared using a likelihood ratio test. + +As input, BACON takes a tab-delimited, headered file of the following form with one row per sample: + +| file | sample | condition | +| -----|--------|-----------| +| /path/to/pigpen/output | sample name | condition ID| diff --git a/bacon.py b/bacon.py deleted file mode 100644 index c394df2..0000000 --- a/bacon.py +++ /dev/null @@ -1,250 +0,0 @@ -#Bioinformatic Analysis of the Conversion Of Nucleotides (BACON) - -#Identify genes whose PORC values are different across conditions -#Two possibilities: GLM (readconditions, makePORCdf, getLMEp) -#or -#subsample reads, calculate many PORC values, using Hotelling T2 to identify genes with different PORC distributions across conditions - -import pandas as pd -import sys -import numpy as np -from collections import OrderedDict -import math -import statsmodels.api as sm -import statsmodels.formula.api as smf -from statsmodels.stats.multitest import multipletests -from scipy.stats.distributions import chi2 -import warnings -import time - - -def readconditions(samp_conds_file): - #Read in a three column file of oincoutputfile / sample / condition - #Return a pandas df - sampconddf = pd.read_csv(samp_conds_file, sep = '\t', index_col = False, header = 0) - - return sampconddf - -def makePORCdf(samp_conds_file, minreads): - #Make a dataframe of PORC values for all samples - #start with GENE...SAMPLE...READCOUNT...PORC - #then make a wide version that is GENE...SAMPLE1READCOUNT...SAMPLE1PORC...SAMPLE2READCOUNT...SAMPLE2PORC - - #Only keep genes that are present in EVERY pigpen file - - minreads = int(minreads) - dfs = [] #list of dfs from individual pigpen outputs - genesinall = set() #genes present in every pigpen file - with open(samp_conds_file, 'r') as infh: - for line in infh: - line = line.strip().split('\t') - if line[0] == 'file': - continue - pigpenfile = line[0] - sample = line[1] - df = pd.read_csv(pigpenfile, sep = '\t', index_col = False, header = 0) - dfgenes = df['Gene'].tolist() - samplecolumn = [sample] * len(dfgenes) - df = df.assign(sample = samplecolumn) - - if not genesinall: #if there are no genes in there (this is the first file) - genesinall = set(dfgenes) - else: - genesinall = genesinall.intersection(set(dfgenes)) - - columnstokeep = ['Gene', 'sample', 'numreads', 'porc'] - df = df[columnstokeep] - dfs.append(df) - - #for each df, filter keeping only the genes present in every df (genesinall) - #Somehow there are some genes whose name in NA - genesinall.remove(np.nan) - dfs = [df.loc[df['Gene'].isin(genesinall)] for df in dfs] - - #concatenate (rbind) dfs together - df = pd.concat(dfs) - - #turn from long into wide - df = df.pivot_table(index = 'Gene', columns = 'sample', values = ['numreads', 'porc']).reset_index() - #flatten multiindex column names - df.columns = ["_".join(a) if '' not in a else a[0] for a in df.columns.to_flat_index()] - - #Filter for genes with at least minreads in every sample - #get columns with numreads info - numreadsColumns = [col for col in df.columns if 'numreads' in col] - #Get minimum in those columns - df['minreadcount'] = df[numreadsColumns].min(axis = 1) - #Filter for rows with minreadcount >= minreads - print('Filtering for genes with at least {0} reads in every sample.'.format(minreads)) - df = df.loc[df['minreadcount'] >= minreads] - print('{0} genes have at least {1} reads in every sample.'.format(len(df), minreads)) - #We also don't want rows with inf/-inf PORC values - df = df.replace([np.inf, -np.inf], np.nan) - df = df.dropna(how= 'any') - #Return a dataframe of just genes and PORC values - columnstokeep = ['Gene'] + [col for col in df.columns if 'porc' in col] - df = df[columnstokeep] - - return df - -def getLMEp(sampconddf, porcdf, conditionA, conditionB): - #Calculate pvalues for genes based on PORC values across conditions using LME model - #Delta porc values are calculated as conditionB - condition A - - deltaporcdict = OrderedDict() #{genename : deltaporc} ordered so it's easy to match it up with porcdf - pvaluedict = OrderedDict() #{genename : pvalue} ordered so it's easy to match it up with q values - - #Get column names in porcdf that are associated with each condition - conditionAsamps = sampconddf.loc[sampconddf['condition'] == conditionA] - conditionAsamps = conditionAsamps['sample'].tolist() - conditionAcolumns = ['porc_' + samp for samp in conditionAsamps] - - conditionBsamps = sampconddf.loc[sampconddf['condition'] == conditionB] - conditionBsamps = conditionBsamps['sample'].tolist() - conditionBcolumns = ['porc_' + samp for samp in conditionBsamps] - - print('Condition A samples: ' + (', ').join(conditionAsamps)) - print('Condition B samples: ' + (', ').join(conditionBsamps)) - - #Store relationships of conditions and the samples in that condition - #It's important that this dictionary be ordered because we are going to be iterating through it - samp_conds = OrderedDict({'condA' : conditionAcolumns, 'condB' : conditionBcolumns}) - - #Get a list of all samples - samps = [] - for cond in samp_conds: - samps += samp_conds[cond] - - #Iterate through rows, making a dictionary from every row, turning it into a dataframe, then calculating p value - genecounter = 0 - for index, row in porcdf.iterrows(): - genecounter +=1 - if genecounter % 1000 == 0: - print('Calculating pvalue for gene {0}...'.format(genecounter)) - - d = {} - d['Gene'] = [row['Gene']] * len(samps) - d['variable'] = samps - - values = [] #porc values - for cond in samp_conds: - for sample in samp_conds[cond]: - value = row[sample] - values.append(value) - d['value'] = values - - #If there is an NA psi value, we are not going to calculate a pvalue for this gene - p = None - if True in np.isnan(values): - p = np.nan - - conds = [] - for cond in samp_conds: - conds += [cond] * len(samp_conds[cond]) - condAs = [] - condBs = [] - for cond in conds: - if cond == 'condA': - condAs.append(1) - condBs.append(0) - elif cond == 'condB': - condAs.append(0) - condBs.append(1) - d['condA'] = condAs #e.g. [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0] - d['condB'] = condBs #e.g. [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1] - - d['samples'] = [x + 1 for x in range(len(samps))] - - #Turn this dictionary into a DataFrame - rowdf = pd.DataFrame.from_dict(d) - - #delta psi is difference between mean psi of two conditions (cond2 - cond1) - condAmeanporc = float(format(np.mean(rowdf.query('condA == 1').value.dropna()), '.3f')) - condBmeanporc = float(format(np.mean(rowdf.query('condB == 1').value.dropna()), '.3f')) - deltaporc = condBmeanporc - condAmeanporc - deltaporc = float(format(deltaporc, '.3f')) - deltaporcdict[row['Gene']] = deltaporc - - #Get LME pvalue, but only if we haven't already determined that the pvalue is NA because we are missing one or more psi values - #Lots of warnings about convergence, etc. Suppress them. - if not p: - with warnings.catch_warnings(): - warnings.filterwarnings('ignore') - - #So apparently, some combinations of psi values will give nan p values due to a LinAlgError that arises from a singular - #hessian matrix during the fit of the model. However, the equivalent code in R (nlme::lme) never gives this error, even with - #the same data. It's not clear from just looking at the psi values why this is. However, I found that by varying the - #start_params in the fit, this can be avoided. If this is done, the resulting p value always matches what is given in R. - #Further, the p value is the same regardless of the start_param. - #But it's not clear to me why changing the start_param matters, or what the default is here or with nlme. - #So let's try a few starting paramters. Regardless, this seems to affect a small number of genes (<1%), and it is causing - #false negatives because genes that should get p values (may or may not be sig) are getting NA. - possible_start_params = [0, 0, 1, -1, 2, -2] - numberoftries = -1 - for param in possible_start_params: - #if we already have a pvalue, don't try again - if p != None and not np.isnan(p): - break - #First time through, numberoftries = 0, and we are just using a placeholder startparam (0) here because we aren't even using it. - #Gonna use whatever the default is - numberoftries += 1 - try: - #actual model - md = smf.mixedlm('value ~ condA', data=rowdf, groups='samples', missing='drop') - if numberoftries == 0: - # REML needs to be false in order to use log-likelihood for pvalue calculation - mdf = md.fit(reml=False) - elif numberoftries > 0: - mdf = md.fit(reml=False, start_params=[param]) - - #null model - nullmd = smf.mixedlm('value ~ 1', data=rowdf, groups='samples', missing='drop') - if numberoftries == 0: - nullmdf = nullmd.fit(reml=False) - elif numberoftries > 0: - nullmdf = nullmd.fit(reml=False, start_params=[param]) - - #Likelihood ratio - LR = 2 * (mdf.llf - nullmdf.llf) - p = chi2.sf(LR, df=1) - - #These exceptions are needed to catch cases where either all psi values are nan (valueerror) or all psi values for one condition are nan (linalgerror) - except (ValueError, np.linalg.LinAlgError): - p = np.nan - - pvaluedict[row['Gene']] = float('{:.2e}'.format(p)) - - #Correct pvalues using BH method, but only using pvalues that are not NA - pvalues = list(pvaluedict.values()) - pvaluesformultitest = [pval for pval in pvalues if str(pval) != 'nan'] - fdrs = list(multipletests(pvaluesformultitest, method='fdr_bh')[1]) - fdrs = [float('{:.2e}'.format(fdr)) for fdr in fdrs] - - #Now we need to incorporate the places where p = NA into the list of FDRs (also as NA) - fdrswithnas = [] - fdrindex = 0 - for pvalue in pvalues: - if str(pvalue) != 'nan': - fdrswithnas.append(fdrs[fdrindex]) - fdrindex += 1 - elif str(pvalue) == 'nan': - fdrswithnas.append(np.nan) - - #Add deltaporcs, pvalues, and FDRs to df - deltaporcs = list(deltaporcdict.values()) - porcdf = porcdf.assign(deltaporc=deltaporcs) - porcdf = porcdf.assign(pval=pvalues) - porcdf = porcdf.assign(FDR=fdrswithnas) - - #Write df - fn = 'porc.pval.txt' - porcdf.to_csv(fn, sep='\t', index=False, float_format='%.3g', na_rep='NA') - - - - -if __name__ == '__main__': - sampconddf = readconditions(sys.argv[1]) - #print(sampconddf) - porcdf = makePORCdf(sys.argv[1], sys.argv[2]) - getLMEp(sampconddf, porcdf, sys.argv[3], sys.argv[4]) diff --git a/cutadapt.sh b/cutadapt.sh new file mode 100644 index 0000000..5194421 --- /dev/null +++ b/cutadapt.sh @@ -0,0 +1,65 @@ +#This file describes a method for trimming adapters from reads produced by libraries made using Quantseq 3' mRNA-seq V2 Library Prep Kit FWD. +#This kit is the recommended one to use with transcriptome-wide OINC-seq experiments. +#It requires that cutadapt (https://cutadapt.readthedocs.io/) be in the user's path. + +#Quantseq samples +#Read1, trim AAAAAAAAAAAAAAAAAAAA from 3' end +#Read2, trim AGATCGGAAGAGCGTCGTGTAGGGAAAGACGGTA from 3' end and TTTTTTTTTTTTTTTTTTTT from 5' end + +#QUANTSEQ STRATEGY +#Step1: get rid of UMI (first 6 nt of read 1) and trim 3' adapter off read 1 (-u 6 ; -a AAAAAAAAAAAAAAAAAAAA) [make temporary output files] +#Step2: Trim 5' adapter of read 2 (TTTTTTTTTTTTTTTTTTTT) [make temporary output files] +#Step3: Try to trim 3' adapter of read 2 (AGATCGGAAGAGCGTCGTGTAGGGAAAGACGGTA). Write untrimmed reads (these are done). [make temporary outfile for trimmed reads] +#Step4: For reads that did have 3' adapter on read 2, remove the last 6 bases on read 2 (UMI). These are now done too. +#Step5: Combine trimmed reads from step4 with untrimmed reads from step3. + +readdir='' #Directory containing all fastq read files, eg file1_1.fq.gz, file1_2.fq.gz, file2_1.fq.gz, etc. +readfiles=( 'file1' 'file2' 'file3' ) #list of file names +outnames=( 'sample1' 'sample2' 'sample3' ) #list of final file names for trimmed reads. In this example, after trimming, file1_1.fq.gz will be named as sample1_1.fq.gz. + + +for i in ${!readfiles[@]} +do + read1=${readdir}${readfiles[$i]}_1.fq.gz + read2=${readdir}${readfiles[$i]}_2.fq.gz + outread1s1=${readdir}${outnames[$i]}_1.temp.step1.fq.gz + outread2s1=${readdir}${outnames[$i]}_2.temp.step1.fq.gz + statsouts1=${readdir}${outnames[$i]}.cutadaptstats.step1.txt + + outread1s2=${readdir}${outnames[$i]}_1.temp.step2.fq.gz + outread2s2=${readdir}${outnames[$i]}_2.temp.step2.fq.gz + statsouts2=${readdir}${outnames[$i]}.cutadaptstats.step2.txt + + outread1s3=${readdir}${outnames[$i]}_1.temp.step3.fq.gz + outread2s3=${readdir}${outnames[$i]}_2.temp.step3.fq.gz + statsouts3=${readdir}${outnames[$i]}.cutadaptstats.step3.txt + untrimmedr1s3=${readdir}${outnames[$i]}_1.untrimmed.step3.fq.gz + untrimmedr2s3=${readdir}${outnames[$i]}_2.untrimmed.step3.fq.gz + + outread1s4=${readdir}${outnames[$i]}_1.temp.step4.fq.gz + outread2s4=${readdir}${outnames[$i]}_2.temp.step4.fq.gz + statsouts4=${readdir}${outnames[$i]}.cutadaptstats.step4.txt + + finaloutread1=${readdir}${outnames[$i]}_1.trimmed.fq.gz + finaloutread2=${readdir}${outnames[$i]}_2.trimmed.fq.gz + + #Step1 + cutadapt -u 6 -U 0 -a AAAAAAAAAAAAAAAAAAAA --minimum-length 25 -j 8 -o ${outread1s1} -p ${outread2s1} ${read1} ${read2} > ${statsouts1} + + #Step2 + cutadapt -G TTTTTTTTTTTTTTTTTTTT --minimum-length 25 -j 8 -o ${outread1s2} -p ${outread2s2} ${outread1s1} ${outread2s1} > ${statsouts2} + + #Step3 + cutadapt -A AGATCGGAAGAGCGTCGTGTAGGGAAAGACGGTA --minimum-length 25 --untrimmed-output ${untrimmedr1s3} --untrimmed-paired-output ${untrimmedr2s3} -o ${outread1s3} -p ${outread2s3} ${outread1s2} ${outread2s2} > ${statsouts3} + + #Step4 + cutadapt -U -6 --minimum-length 25 -j 8 -o ${outread1s4} -p ${outread2s4} ${outread1s3} ${outread2s3} > ${statsouts4} + + #Step5 + cat ${untrimmedr1s3} ${outread1s4} > ${finaloutread1} + cat ${untrimmedr2s3} ${outread2s4} > ${finaloutread2} + + rm *temp*fq.gz + rm *untrimmed* + +done \ No newline at end of file diff --git a/getmismatches.py b/getmismatches.py deleted file mode 100644 index a792275..0000000 --- a/getmismatches.py +++ /dev/null @@ -1,594 +0,0 @@ -import pysam -import os -import re -import sys -from collections import defaultdict -import numpy as np -import pandas as pd -from snps import getSNPs, recordSNPs -import pickle -import multiprocessing as mp -import subprocess - - -def revcomp(nt): - revcompdict = { - 'G' : 'C', - 'C' : 'G', - 'A' : 'T', - 'T' : 'A', - 'N' : 'N', - 'g' : 'c', - 'c' : 'g', - 'a' : 't', - 't' : 'a', - 'n' : 'n' - } - - nt_rc = revcompdict[nt] - - return nt_rc - - -def iteratereads_singleend(bam, snps = None): - #Read through a bam containing single end reads (or if it contains paired end reads, just use read 1) - #Find nt conversion locations for each read. - #Store the number of each conversion for each read in a dictionary. - - #Bam must contain MD tag. - - readcounter = 0 - convs = {} #{readid : dictionary of all conversions} - with pysam.AlignmentFile(bam, 'r') as infh: - print('Finding nucleotide conversions in {0}...'.format(os.path.basename(bam))) - for read in infh.fetch(until_eof = True): - if read.is_secondary or read.is_supplementary or read.is_unmapped: - continue - if read.is_read1: - readcounter +=1 - if readcounter % 10000000 == 0: - print('Finding nucleotide conversions in read {0}...'.format(readcounter)) - - #Check mapping quality - #For nextgenmap, max mapq is 60 - if read.mapping_quality < 50: - continue - - queryname = read.query_name - queryseq = read.query_sequence #this is always on the + strand, no matter what strand the read maps to - chrom = read.reference_name - qualities = list(read.query_qualities) - - #Get a set of snp locations if we have them - if snps: - if chrm in snps: - snplocations = snps[chrm] #set of coordinates to mask - else: - snplocations = None - else: - snplocations = None - - if read.is_reverse: - strand = '-' - elif not read.is_reverse: - strand = '+' - - alignedpairs = read.get_aligned_pairs(with_seq = True) - readqualities = list(read.query_qualities) - convs_in_read = getmismatches_singleend(alignedpairs, queryseq, readqualities, strand, chrom, snplocations) - - convs[queryname] = convs_in_read - - - #Pickle and write convs? - return convs - - -def getmismatches_singleend(alignedpairs, queryseq, readqualities, strand, chrom, snplocations): - #remove tuples that have None - #These are either intronic or might have been soft-clipped - #Tuples are (querypos, (0-based) refpos, refsequence) - #If there is a substitution, refsequence is lower case - - #refnt and querynt, as supplied by pysam, are always + strand - #everything here is assuming read is on sense strand - - #snplocations is a set of chrm_coord locations. At these locations, all queries will be treated as not having a conversion - - alignedpairs = [x for x in alignedpairs if None not in x] - - #if we have snps, remove their locations from alignedpairs - #snplocations is a set of 0-based coordinates of snp locations to mask - if snplocations: - alignedpairs = [x for x in alignedpairs if x[1] not in snplocations] - - convs = {} #counts of conversions x_y where x is reference sequence and y is query sequence - convlocations = defaultdict(list) #{type of conv : [locations of conversion]} - - possibleconvs = [ - 'a_a', 'a_t', 'a_c', 'a_g', 'a_n', - 'g_a', 'g_t', 'g_c', 'g_g', 'g_n', - 'c_a', 'c_t', 'c_c', 'c_g', 'c_n', - 't_a', 't_t', 't_c', 't_g', 't_n'] - - #initialize dictionary - for conv in possibleconvs: - convs[conv] = 0 - - for alignedpair in alignedpairs: - refnt = alignedpair[2] - - #if reference is N, skip this position - if refnt == 'N' or refnt == 'n': - continue - - if strand == '-': - refnt = revcomp(refnt) - - #Not a conversion - if refnt.isupper(): - conv = refnt.lower() + '_' + refnt.lower() - #Is a conversion - elif refnt.islower(): - querynt = queryseq[alignedpair[0]] - if strand == '-': - querynt = revcomp(querynt) - conv = refnt.lower() + '_' + querynt.lower() - - #If the quality at this position passes threshold, record the conversion. - #Otherwise, skip it. - qscore = readqualities[alignedpair[0]] - if qscore >= 30: #can change this later - convs[conv] +=1 - else: - pass - - return convs - -def read_pair_generator(bam, region_string=None): - """ - Generate read pairs in a BAM file or within a region string. - Reads are added to read_dict until a pair is found. - https://www.biostars.org/p/306041/ - """ - read_dict = defaultdict(lambda: [None, None]) - for read in bam: - if not read.is_proper_pair or read.is_secondary or read.is_supplementary or read.mate_is_unmapped or read.is_unmapped: - continue - qname = read.query_name - if qname not in read_dict: - if read.is_read1: - read_dict[qname][0] = read - else: - read_dict[qname][1] = read - else: - if read.is_read1: - yield read, read_dict[qname][1] - else: - yield read_dict[qname][0], read - del read_dict[qname] - -def findsnps(controlbams, genomefasta, minCoverage = 20, minVarFreq = 0.02): - controlbams = controlbams.split(',') - getSNPs(controlbams, genomefasta, minCoverage, minVarFreq) - snps = recordSNPs('merged.vcf') - - return snps - -def iteratereads_pairedend(bam, onlyConsiderOverlap, snps = None, requireMultipleConv = False, verbosity = 'high'): - #Iterate over reads in a paired end alignment file. - #Find nt conversion locations for each read. - #For locations interrogated by both mates of read pair, conversion must exist in both mates in order to count - #Store the number of conversions for each read in a dictionary - - #Quality score array is always in the same order as query_sequence, which is always on the + strand - #Bam must contain MD tags - - if onlyConsiderOverlap == 'True': - onlyConsiderOverlap = True - elif onlyConsiderOverlap == 'False': - onlyConsiderOverlap = False - - queriednts = [] - readcounter = 0 - convs = {} #{readid : dictionary of all conversions} - save = pysam.set_verbosity(0) - with pysam.AlignmentFile(bam, 'r') as infh: - if verbosity == 'high': - print('Finding nucleotide conversions in {0}...'.format(os.path.basename(bam))) - for read1, read2 in read_pair_generator(infh): - - #Just double check that the pairs are matched - if read1.query_name != read2.query_name: - continue - - #Check mapping quality - #MapQ is 255 for uniquely aligned reads FOR STAR ONLY - if read1.mapping_quality < 255 or read2.mapping_quality < 255: - continue - - readcounter +=1 - if readcounter % 1000000 == 0: - if verbosity == 'high': - print('Finding nucleotide conversions in read {0}...'.format(readcounter)) - - queryname = read1.query_name - chrm = read1.reference_name - - #Get a set of snp locations if we have them - if snps: - if chrm in snps: - snplocations = snps[chrm] #set of coordinates to mask - else: - snplocations = None - else: - snplocations = None - - read1queryseq = read1.query_sequence - read1alignedpairs = read1.get_aligned_pairs(with_seq = True) - if read1.is_reverse: - read1strand = '-' - elif not read1.is_reverse: - read1strand = '+' - - read2queryseq = read2.query_sequence - read2alignedpairs = read2.get_aligned_pairs(with_seq = True) - if read2.is_reverse: - read2strand = '-' - elif not read2.is_reverse: - read2strand = '+' - - read1qualities = list(read1.query_qualities) #phred scores - read2qualities = list(read2.query_qualities) - - convs_in_read = getmismatches_pairedend(read1alignedpairs, read2alignedpairs, read1queryseq, read2queryseq, read1qualities, read2qualities, read1strand, read2strand, snplocations, onlyConsiderOverlap, requireMultipleConv) - queriednts.append(sum(convs_in_read.values())) - convs[queryname] = convs_in_read - - if verbosity == 'high': - print('Queried {0} read pairs in {1}.'.format(readcounter, os.path.basename(bam))) - - pysam.set_verbosity(save) - #Pickle and write convs? - return convs, readcounter - - -def getmismatches_pairedend(read1alignedpairs, read2alignedpairs, read1queryseq, read2queryseq, read1qualities, read2qualities, read1strand, read2strand, snplocations, onlyoverlap, requireMultipleConv): - #remove tuples that have None - #These are either intronic or might have been soft-clipped - #Tuples are (querypos, refpos, refsequence) - #If there is a substitution, refsequence is lower case - - #In quantseq-fwd, r1 is always sense strand - - read1alignedpairs = [x for x in read1alignedpairs if None not in x] - read2alignedpairs = [x for x in read2alignedpairs if None not in x] - - #if we have snps, remove their locations from read1alignedpairs and read2alignedpairs - #snplocations is a set of 0-based coordinates of snp locations to mask - if snplocations: - read1alignedpairs = [x for x in read1alignedpairs if x[1] not in snplocations] - read2alignedpairs = [x for x in read2alignedpairs if x[1] not in snplocations] - - convs = {} #counts of conversions x_y where x is reference sequence and y is query sequence - - possibleconvs = [ - 'a_a', 'a_t', 'a_c', 'a_g', 'a_n', - 'g_a', 'g_t', 'g_c', 'g_g', 'g_n', - 'c_a', 'c_t', 'c_c', 'c_g', 'c_n', - 't_a', 't_t', 't_c', 't_g', 't_n'] - - #initialize dictionary - for conv in possibleconvs: - convs[conv] = 0 - - #For locations interrogated by both mates of read pair, conversion must exist in both mates in order to count - #These locations (as defined by their reference positions) would be found both in read1alignedpairs and read2alignedpairs - - #Get the ref positions queried by the two reads - r1dict = {} #{reference position : [queryposition, reference sequence]} - r2dict = {} - for x in read1alignedpairs: - r1dict[int(x[1])] = [x[0], x[2]] - for x in read2alignedpairs: - r2dict[int(x[1])] = [x[0], x[2]] - - mergedalignedpairs = {} # {refpos : [R1querypos, R2querypos, R1refsequence, R2refsequence]} - #For positions only in R1 or R2, querypos and refsequence are NA for the other read - for refpos in r1dict: - r1querypos = r1dict[refpos][0] - r1refseq = r1dict[refpos][1] - if refpos in mergedalignedpairs: #this should not be possible because we are looking at r1 first - r2querypos = mergedalignedpairs[refpos][1] - r2refseq = mergedalignedpairs[refpos][3] - mergedalignedpairs[refpos] = [r1querypos, r2querypos, r1refseq, r2refseq] - else: - mergedalignedpairs[refpos] = [r1querypos, 'NA', r1refseq, 'NA'] - - for refpos in r2dict: - #same thing - r2querypos = r2dict[refpos][0] - r2refseq = r2dict[refpos][1] - if refpos in mergedalignedpairs: #if we saw it for r1 - r1querypos = mergedalignedpairs[refpos][0] - r1refseq = mergedalignedpairs[refpos][2] - mergedalignedpairs[refpos] = [r1querypos, r2querypos, r1refseq, r2refseq] - else: - mergedalignedpairs[refpos] = ['NA', r2querypos, 'NA', r2refseq] - - - #Now go through mergedalignedpairs, looking for conversions. - #For positions observed both in r1 and r2, a conversion must be present in both reads, - #otherwise it will be recorded as not having a conversion. - - for refpos in mergedalignedpairs: - r1querypos = mergedalignedpairs[refpos][0] - r2querypos = mergedalignedpairs[refpos][1] - r1refseq = mergedalignedpairs[refpos][2] - r2refseq = mergedalignedpairs[refpos][3] - - if r1querypos != 'NA' and r2querypos == 'NA': #this position queried by r1 only - if read1strand == '-': - #refseq needs to equal the sense strand (it is always initially defined as the + strand). read1 is always the sense strand. - r1refseq = revcomp(r1refseq) - - #If reference is N, skip this position - if r1refseq == 'N' or r1refseq == 'n': - continue - - if r1refseq.isupper(): #not a conversion - conv = r1refseq.lower() + '_' + r1refseq.lower() - elif r1refseq.islower(): #is a conversion - querynt = read1queryseq[r1querypos] - if read1strand == '-': - querynt = revcomp(querynt) - conv = r1refseq.lower() + '_' + querynt.lower() - - if read1qualities[r1querypos] >= 30 and onlyoverlap == False: - convs[conv] +=1 - else: - pass - - elif r1querypos == 'NA' and r2querypos != 'NA': #this position is queried by r2 only - if read1strand == '-': - r2refseq = revcomp(r2refseq) #reference seq is independent of which read we are talking about - - if r2refseq == 'N' or r2refseq == 'n': - continue - - if r2refseq.isupper(): #not a conversion - conv = r2refseq.lower() + '_' + r2refseq.lower() - elif r2refseq.islower(): #is a conversion - querynt = read2queryseq[r2querypos] - if read1strand == '-': - #Read1 is always the sense strand. r1queryseq and r2queryseq are always + strand - #The reference sequence is always on the + strand. - #If read1 is on the - strand, we have already flipped reference seq (see a few lines above). - #We need to flip read2queryseq so that it is also - strand. - querynt = revcomp(querynt) - conv = r2refseq.lower() + '_' + querynt.lower() - - if read2qualities[r2querypos] >= 30 and onlyoverlap == False: - convs[conv] +=1 - else: - pass - - elif r1querypos != 'NA' and r2querypos != 'NA': #this position is queried by both reads - if read1strand == '-': - r1refseq = revcomp(r1refseq) - r2refseq = revcomp(r2refseq) - - if r1refseq == 'N' or r2refseq == 'N' or r1refseq == 'n' or r2refseq == 'n': - continue - - #If the position is not high quality in either r1 or r2, skip it - if read1qualities[r1querypos] < 30 and read2qualities[r2querypos] < 30: - continue - - if r1refseq.isupper() and r2refseq.isupper(): #both reads agree it is not a conversion - conv = r1refseq.lower() + '_' + r1refseq.lower() - convs[conv] += 1 - - elif r1refseq.isupper() and r2refseq.islower(): #r1 says no conversion, r2 says conversion, so we say no conversion - conv = r1refseq.lower() + '_' + r1refseq.lower() - convs[conv] += 1 - - elif r1refseq.islower() and r2refseq.isupper(): #r1 says conversion, r2 says no conversion, so we say no conversion - conv = r2refseq.lower() + '_' + r2refseq.lower() - convs[conv] += 1 - - elif r1refseq.islower() and r2refseq.islower(): #both reads say there was a conversion - r1querynt = read1queryseq[r1querypos] - r2querynt = read2queryseq[r2querypos] - if read1strand == '-': - r1querynt = revcomp(r1querynt) - r2querynt = revcomp(r2querynt) - - #If the query nts don't match, skip this position - if r1querynt == r2querynt: - conv = r1refseq.lower() + '_' + r1querynt.lower() - convs[conv] +=1 - else: - pass - - #if we are requiring there be multiple g_t or g_c, implement that here - if requireMultipleConv: - if convs['g_t'] + convs['g_c'] >= 2: - pass - elif convs['g_t'] + convs['g_c'] < 2: - convs['g_t'] = 0 - convs['g_c'] = 0 - - return convs - - -def summarize_convs(convs, outfile): - #Take in a dictionary of conversions and calculate conversion rates for all conversions - #convs has the form {readid : {a_a : count, a_t : count, etc.}} - - a = 0 #running counter of all *reference* a's (converted and not converted) - a_t = 0 #running counter of all a to t conversions - a_c = 0 - a_g = 0 - a_n = 0 - c = 0 - c_t = 0 - c_g = 0 - c_a = 0 - c_n = 0 - g = 0 - g_t = 0 - g_a = 0 - g_c = 0 - g_n = 0 - t = 0 - t_a = 0 - t_g = 0 - t_c = 0 - t_n = 0 - - for read in convs: - conv_in_read = convs[read] - a += (conv_in_read['a_a'] + conv_in_read['a_t'] + conv_in_read['a_c'] + conv_in_read['a_g'] + conv_in_read['a_n']) - a_t += conv_in_read['a_t'] - a_c += conv_in_read['a_c'] - a_g += conv_in_read['a_g'] - a_n += conv_in_read['a_n'] - - c += (conv_in_read['c_a'] + conv_in_read['c_t'] + conv_in_read['c_c'] + conv_in_read['c_g'] + conv_in_read['c_n']) - c_t += conv_in_read['c_t'] - c_a += conv_in_read['c_a'] - c_g += conv_in_read['c_g'] - c_n += conv_in_read['c_n'] - - g += (conv_in_read['g_a'] + conv_in_read['g_t'] + conv_in_read['g_c'] + conv_in_read['g_g'] + conv_in_read['g_n']) - g_t += conv_in_read['g_t'] - g_a += conv_in_read['g_a'] - g_c += conv_in_read['g_c'] - g_n += conv_in_read['g_n'] - - t += (conv_in_read['t_a'] + conv_in_read['t_t'] + conv_in_read['t_c'] + conv_in_read['t_g'] + conv_in_read['t_n']) - t_g += conv_in_read['t_g'] - t_a += conv_in_read['t_a'] - t_c += conv_in_read['t_c'] - t_n += conv_in_read['t_n'] - - totalnt = a + c + g + t - totalconv = a_t + a_c + a_g + c_t + c_a + c_g + g_t + g_a + g_c + t_g + t_a + t_c - - totalerrorrate = totalconv / totalnt - - with open(outfile, 'w') as outfh: - outfh.write(('\t').join([ - 'Acount', 'A_Tcount', 'A_Ccount', 'A_Gcount', 'A_Ncount', - 'Ccount', 'C_Tcount', 'C_Acount', 'C_Gcount', 'C_Ncount', - 'Gcount', 'G_Tcount', 'G_Ccount', 'G_Acount', 'G_Ncount', - 'Tcount', 'T_Acount', 'T_Ccount', 'T_Gcount', 'T_Ncount', - 'A_Trate', 'A_Crate', 'A_Grate', 'A_Nrate', - 'C_Trate', 'C_Arate', 'C_Grate', 'C_Nrate', - 'G_Trate', 'G_Crate', 'G_Arate', 'G_Nrate', - 'T_Arate', 'T_Crate', 'T_Grate', 'T_Nrate', 'totalnt', 'totalconv', 'totalerrorrate' - ]) + '\n') - - outfh.write(('\t').join([ - str(a), str(a_t), str(a_c), str(a_g), str(a_n), - str(c), str(c_t), str(c_a), str(c_g), str(c_n), - str(g), str(g_t), str(g_c), str(g_a), str(g_n), - str(t), str(t_a), str(t_c), str(t_g), str(t_n), - str(a_t / a), str(a_c / a), str(a_g / a), str(a_n / a), - str(c_t / c), str(c_a / c), str(c_g / c), str(c_n / c), - str(g_t / g), str(g_c / g), str(g_a / g), str(g_n / g), - str(t_a / t), str(t_c / t), str(t_g / t), str(t_n / t), - str(totalnt), str(totalconv), str(totalerrorrate) - ])) - -def split_bam(bam, nproc): - #Split bam using samtools instead of bamtools (may be faster) - #Check for index - if os.path.exists(os.path.abspath(bam) + '.bai'): - pass - else: - indexCMD = 'samtools index ' + os.path.abspath(bam) - index = subprocess.Popen(indexCMD, shell = True) - index.wait() - - #Get chromosome names - idxstatsCMD = 'samtools idxstats ' + os.path.abspath(bam) - idxstats = subprocess.Popen(idxstatsCMD, shell = True, stdout = subprocess.PIPE) - chrms = [] - for line in idxstats.stdout: - line = line.decode('UTF-8').strip().split('\t') - chrm = line[0] - if chrm != '*': - chrms.append(chrm) - - splitbams = [] - for chrm in chrms: - filebasename = chrm + '_SPLIT.bam' - filepath = os.path.join(os.path.dirname(os.path.abspath(bam)), filebasename) - splitbams.append(filepath) - splitCMD = 'samtools view -@ ' + str(nproc) + ' -b ' + os.path.abspath(bam) + ' ' + chrm + ' > ' + filepath - s = subprocess.Popen(splitCMD, shell = True) - s.wait() - - return splitbams - - -def getmismatches(bam, onlyConsiderOverlap, snps, requireMultipleConv, nproc): - #Actually run the mismatch code (calling iteratereads_pairedend) - #use multiprocessing - #If there's only one processor, easier to use iteratereads_pairedend() directly. - - pool = mp.Pool(processes = int(nproc)) - print('Using {0} processors to identify mismatches in {1}.'.format(nproc, bam)) - splitbams = split_bam(bam, int(nproc)) - argslist = [] - for x in splitbams: - argslist.append((x, bool(onlyConsiderOverlap), snps, bool(requireMultipleConv), 'low')) - - #items returned from iteratereads_pairedend are in a list, one per process - totalreadcounter = 0 #number of reads across all the split bams - results = pool.starmap(iteratereads_pairedend, argslist) #thhis actually returns two things, convs and readcounter - #so i bet this is a nested list where the first item in each list in a convs and the second item is a readcounter - convs_split = [] - for result in results: - convs_split.append(result[0]) - for result in results: - totalreadcounter += result[1] - - print('Queried {0} read pairs in {1}.'.format(totalreadcounter, os.path.basename(bam))) - - #Reorganize convs_split into convs as it is without multiprocessing - convs = {} #{readid : dictionary of all conversions} - for splitconv in convs_split: - convs.update(splitconv) - - #cleanup - for splitbam in splitbams: - os.remove(splitbam) - - return convs - - - - - -if __name__ == '__main__': - snps = findsnps('trash', None, 20, 0.02) #control bams (comma separated), genomefasta - - #convs = iteratereads_singleend(sys.argv[1], None) - convs = iteratereads_pairedend(sys.argv[1], True, snps, False) - #with open('OINC3.mDBF.subsampled.filtered.convs.pkl', 'wb') as outfh: - #pickle.dump(convs, outfh) - #summarize_convs(convs, sys.argv[2]) - overlaps, numpairs = getReadOverlaps(sys.argv[1], sys.argv[2], sys.argv[3]) #bam, geneBed, chrom.sizes - read2gene = processOverlaps(overlaps, numpairs) - with open('read2gene.pkl', 'wb') as outfh: - pickle.dump(read2gene, outfh) - with open('convs.pkl', 'wb') as outfh: - pickle.dump(convs, outfh) - - numreadspergene, convsPerGene = getPerGene(convs, read2gene) - writeConvsPerGene(numreadspergene, convsPerGene, sys.argv[4]) #output - - \ No newline at end of file diff --git a/glm.py b/glm.py deleted file mode 100644 index e4ab00f..0000000 --- a/glm.py +++ /dev/null @@ -1,180 +0,0 @@ -import pandas as pd -import sys -from functools import reduce -import statsmodels.api as sm -import statsmodels.formula.api as smf -from scipy.stats.distributions import chi2 -import numpy as np -from statsmodels.stats.multitest import multipletests -from collections import OrderedDict -from statsmodels.tools.sm_exceptions import PerfectSeparationError as PSE - - -def combinesamples(slamdunkouts, samplenames): - #Given a slew of slam dunk outputs, combine them together into one df - #slamdunkouts is a comma-separated list of filepaths - #samplenames is a comma-separated list of names, in the same order as slamdunkouts - #conditions is a comma-separated list of conditionIDs (max 2), in the same order of slamdunkouts - - dfs = [] - samplenames = samplenames.split(',') - #conditions = conditions.split(',') - for idx, sd in enumerate(slamdunkouts.split(',')): - #skip first 2 header lines - df = pd.read_csv(sd, sep = '\t', skiprows = 2, header = 0) - columns = list(df.columns) - columnstokeep = ['Name', 'G_G', 'G_C', 'G_T'] - columnstodrop = [c for c in columns if c not in columnstokeep] - df = df.drop(labels = columnstodrop, axis = 1) - #Combine G_C and G_T - df = df.assign(G_mut = df['G_C'] + df['G_T']) - #totalGs - df = df.assign(totalG = df['G_G'] + df['G_mut']) - - #rename columns in preparation for merging - columns = list(df.columns) - untouchablecolumnnames = ['Name'] - samplename = samplenames[idx] - columns = [c + ';' + samplename if c not in untouchablecolumnnames else c for c in columns] - df.columns = columns - df = df.drop_duplicates(ignore_index = True) #somehow duplicate transcripts? - dfs.append(df) - - bigdf = reduce(lambda x, y: pd.merge(x, y, on = ['Name']), dfs) - bigdf = bigdf.drop_duplicates(ignore_index = True) - #Remove any row with NA value - bigdf = bigdf.dropna(axis = 0, how = 'any') - - return bigdf - -def classifysamples(samplenames, conditions): - #samplenames is a comma-separated list of names - #conditions is a comma-separated list of conditionIDs (max 2), in the same order of samplenames - samplenames = samplenames.split(',') - conditions = conditions.split(',') - d = dict(zip(samplenames, conditions)) - - return d - -def doglm(bigdf, sampconds): - genecounter = 0 - ps = OrderedDict() #{genename : p} - mintotalGs = OrderedDict() #{genename : min number of total Gs across all samples} - meantotalGs = OrderedDict() #{genename : mean number of total Gs across all samples} - condArates = OrderedDict() #{genename : condA mean mutation rate} - condBrates = OrderedDict() #{genename : condA mean mutation rate} - ratelog2fc = OrderedDict() #{genename : log2fc in mutation rates (B/A)} - #sampconds is dict of {samplename : condition} - for index, row in bigdf.iterrows(): - genecounter +=1 - if genecounter % 1000 == 0: - print('Gene {0}...'.format(genecounter)) - samples = list(sampconds.keys()) - G_Gs = [] - G_Cs = [] - G_Ts = [] - G_muts = [] - totalGs = [] - conds = [] - for sample in samples: - G_G = row['G_G;{0}'.format(sample)] - G_C = row['G_C;{0}'.format(sample)] - G_T = row['G_T;{0}'.format(sample)] - G_mut = row['G_mut;{0}'.format(sample)] - totalG = row['totalG;{0}'.format(sample)] - cond = sampconds[sample] - G_Gs.append(G_G) - G_Cs.append(G_C) - G_Ts.append(G_T) - G_muts.append(G_mut) - totalGs.append(totalG) - conds.append(cond) - - mintotalGs[row['Name']] = min(totalGs) - meantotalGs[row['Name']] = np.mean(totalGs) - d = {'G_G' : G_Gs, 'G_C' : G_Cs, 'G_T' : G_Ts, 'G_mut' : G_muts, 'totalG' : totalGs, 'cond' : conds} - rowdf = pd.DataFrame.from_dict(d) - totalGs = rowdf['totalG'].tolist() - totalmuts = rowdf['G_mut'].tolist() - minG = min(totalGs) - minmut = min(totalmuts) - #Implement totalG count filter - if minG < 100: - p = np.nan - - else: - try: - #GLM - mod_real = smf.glm('G_mut + G_G ~ cond', family = sm.families.Binomial(), data = rowdf).fit() - mod_null = smf.glm('G_mut + G_G ~ 1', family = sm.families.Binomial(), data = rowdf).fit() - - #Likelihood ratio test - logratio = (mod_real.llf - mod_null.llf) * 2 - p = round(chi2.sf(logratio, df = 1), 4) - #If all mutation counts in one condition are 0, this causes a problem called Perfect Separation Error - #Interestingly, this is not triggered if there is only 1 replicate in a condition - except PSE: - p = np.nan - - - ps[row['Name']] = p - - #Calculate mean conversion rates for each condition and the difference between them - conds = sorted(list(set(conds))) #conds are alphabetically sorted. condA is the first one, condB is the second one - for idx, cond in enumerate(conds): - conddf = rowdf[rowdf['cond'] == cond] - rates = [] - for condi, condr in conddf.iterrows(): - try: - rate = condr['G_mut'] / condr['totalG'] - except ZeroDivisionError: - rate = np.nan - - rates.append(rate) - - meanrate = np.mean(rates) - if idx == 0: - condArates[row['Name']] = meanrate - condArate = meanrate - elif idx == 1: - condBrates[row['Name']] = meanrate - condBrate = meanrate - - pc = 1e-6 - log2fc = np.log2((condBrate + pc) / (condArate + pc)) - ratelog2fc[row['Name']] = log2fc - - #Correct pvalues using BH method, but only using pvalues that are not NA - pvalues = list(ps.values()) - pvaluesformultitest = [pval for pval in pvalues if str(pval) != 'nan'] - fdrs = list(multipletests(pvaluesformultitest, method = 'fdr_bh')[1]) - fdrs = [float('{:.2e}'.format(fdr)) for fdr in fdrs] - - #Now we need to incorporate the places where p = NA into the list of FDRs (also as NA) - fdrswithnas = [] - fdrindex = 0 - for pvalue in pvalues: - if str(pvalue) != 'nan': - fdrswithnas.append(fdrs[fdrindex]) - fdrindex +=1 - elif str(pvalue) == 'nan': - fdrswithnas.append(np.nan) - - genes = bigdf['Name'].tolist() - outd = {'Gene' : genes, 'minGcount' : list(mintotalGs.values()), 'meanGcount' : list(meantotalGs.values()), '{0}mutrate'.format(conds[0]) : list(condArates.values()), '{0}mutrate'.format(conds[1]) : list(condBrates.values()), 'log2fc' : list(ratelog2fc.values()), 'p' : list(ps.values()), 'FDR' : fdrswithnas} - outdf = pd.DataFrame.from_dict(outd) - - #format output columns - formats = {'minGcount': '{:d}', 'meanGcount': '{:.2f}', '{0}mutrate'.format(conds[0]): '{:.2e}', '{0}mutrate'.format(conds[1]): '{:.2e}', 'log2fc': '{:.2f}', 'p': '{:.3f}', 'FDR': '{:.3f}'} - for col, f in formats.items(): - outdf[col] = outdf[col].map(lambda x : f.format(x)) - - outdf.to_csv('glm.txt', sep = '\t', header = True, index = False, na_rep = 'NA') - - -#Usage: glm.py -#e.g. glm.py output1.txt,output2.txt,output3.txt,output4.txt S1R1,S1R2,S2R1,S2R2, S1,S1,S2,S2 - -bigdf = combinesamples(sys.argv[1], sys.argv[2]) -sampconds = classifysamples(sys.argv[2], sys.argv[3]) -doglm(bigdf, sampconds) diff --git a/pigpen.py b/pigpen.py deleted file mode 100644 index c222157..0000000 --- a/pigpen.py +++ /dev/null @@ -1,86 +0,0 @@ -#Pipeline for Identification of Guanosine Positions Erroneously Notated -#PIGPEN - -import argparse -import subprocess -import os -from snps import getSNPs, recordSNPs -from filterbam import intersectreads, filterbam, intersectreads_multiprocess -from getmismatches import iteratereads_pairedend, getmismatches -from assignreads import getReadOverlaps, processOverlaps -from conversionsPerGene import getPerGene, writeConvsPerGene - -if __name__ == '__main__': - parser = argparse.ArgumentParser(description = 'pigpen for quantifying OINC-seq data') - parser.add_argument('--bam', type = str, help = 'Aligned reads (ideally STAR uniquely aligned reads) to quantify', - required = True) - parser.add_argument('--controlBams', type = str, help = 'Comma separated list of alignments from control samples (i.e. those where no *induced* conversions are expected. Required if SNPs are to be considered.') - parser.add_argument('--genomeFasta', type = str, help = 'Genome sequence in fasta format. Required if SNPs are to be considered.') - parser.add_argument('--geneBed', type = str, help = 'Bed file of genomic regions to quantify. Fourth field must be gene ID.') - parser.add_argument('--chromsizes', type = str, help = 'Tab-delimited file of chromosomes in the order the appear in the bed/bam and their sizes. Can be made with cut -f 1,2 genome.fa.fai') - parser.add_argument('--output', type = str, help = 'Output file of conversion rates for each gene.') - parser.add_argument('--nproc', type = int, help = 'Number of processors to use. Default is 1.', default = 1) - parser.add_argument('--useSNPs', action = 'store_true', help = 'Consider SNPs?') - parser.add_argument('--SNPcoverage', type = int, help = 'Minimum coverage to call SNPs. Default = 20', default = 20) - parser.add_argument('--SNPfreq', type = float, help = 'Minimum variant frequency to call SNPs. Default = 0.02', default = 0.02) - parser.add_argument('--onlyConsiderOverlap', action = 'store_true', help = 'Only consider conversions seen in both reads of a read pair?') - parser.add_argument('--requireMultipleConv', action = 'store_true', help = 'Only consider conversions seen in reads with multiple G->C + G->T conversions?') - args = parser.parse_args() - - #Make index for bam if there isn't one already - bamindex = args.bam + '.bai' - if not os.path.exists(bamindex): - indexCMD = 'samtools index ' + args.bam - index = subprocess.Popen(indexCMD, shell = True) - index.wait() - - #Make vcf file for snps - if args.useSNPs: - controlbams = args.controlBams.split(',') - - #Make index for each control bam if there isn't one already - for bam in controlbams: - bamindex = bam + '.bai' - if not os.path.exists(bamindex): - indexCMD = 'samtools index ' + bam - index = subprocess.Popen(indexCMD, shell = True) - index.wait() - - vcfFileNames = getSNPs(controlbams, args.genomeFasta, args.SNPcoverage, args.SNPfreq) - snps = recordSNPs('merged.vcf') - - elif not args.useSNPs: - snps = None - - #Filter bam for reads contained within entries in geneBed - #This will reduce the amount of time it takes to find conversions - print('Filtering bam for reads contained within regions of interest...') - if args.nproc == 1: - intersectreads(args.bam, args.geneBed, args.chromsizes) - filteredbam = filterbam(args.bam, args.nproc) - elif args.nproc > 1: - intersectreads_multiprocess(args.bam, args.geneBed, args.chromsizes, args.nproc) - filteredbam = filterbam(args.bam, args.nproc) - - #Identify conversions - if args.nproc == 1: - convs, readcounter = iteratereads_pairedend(filteredbam, args.onlyConsiderOverlap, snps, args.requireMultipleConv, 'high') - elif args.nproc > 1: - convs = getmismatches(filteredbam, args.onlyConsiderOverlap, snps, args.requireMultipleConv, args.nproc) - - - #Assign reads to genes - print('Assigning reads to genes...') - overlaps, numpairs = getReadOverlaps(filteredbam, args.geneBed, args.chromsizes) - read2gene = processOverlaps(overlaps, numpairs) - - #Calculate number of conversions per gene - numreadspergene, convsPerGene = getPerGene(convs, read2gene) - writeConvsPerGene(numreadspergene, convsPerGene, args.output) - - - - - - - diff --git a/pigpen_env.yaml b/pigpen_env.yaml new file mode 100644 index 0000000..7d0478d --- /dev/null +++ b/pigpen_env.yaml @@ -0,0 +1,23 @@ +name: pigpen_env +channels: + - conda-forge + - bioconda +dependencies: + - python >=3.5,<=3.10 + - samtools >=1.15 + - varscan >=2.4.4 + - bcftools >=1.15 + - pysam >=0.19 + - numpy >=1.21 + - pandas >=1.3.5 + - bamtools >=2.5.2 + - salmon >=1.9.0 + - gffutils >=0.11.0 + - pybedtools >=0.9.0 + - umi_tools >=1.1.0 + - star >=2.7.10 + - statsmodels >=0.13.2 + - rpy2 >=3.4.5 + - r-base >=4.1 + - r-lme4 >=1.1 + - postmaster >=0.1.0 \ No newline at end of file diff --git a/setup.py b/setup.py new file mode 100644 index 0000000..5875fb9 --- /dev/null +++ b/setup.py @@ -0,0 +1,12 @@ +#from distutils.core import setup +#add tests? +from setuptools import setup, find_packages +setup(name = 'pigpen', +description = 'Pipeline for the Identification of Guanosine Positions Erroneously Notated', +author = 'Matthew Taliaferro', +author_email = 'taliaferrojm@gmail.com', +url = 'https://github.com/TaliaferroLab/OINC-seq', +version = '0.0.7', +packages = find_packages(where = './src', exclude = ['workflow', 'testdata']), +package_dir = {'':'src'}, +entry_points = {'console_scripts': ['pigpen = pigpen.runpigpen:main', 'bacon = pigpen.bacon_glm:main', 'alignAndQuant = pigpen.alignAndQuant:main']}) diff --git a/src/pigpen/ExtractUMI.py b/src/pigpen/ExtractUMI.py new file mode 100644 index 0000000..d21248f --- /dev/null +++ b/src/pigpen/ExtractUMI.py @@ -0,0 +1,62 @@ +import os +import subprocess +import sys +import shutil +import argparse +''' +Given a pair of read files, extract UMIs matching a given pattern +This step is required for any downstream UMI deduplication +Requires UMI_tools +Usage: +python -u ~/Projects/OINC_seq/test/py2test/ExtractUMI.py --forwardreads ./ATP5MC1_Rep1_pDBF.10M.R1.fq.gz,./ATP5MC1_Rep1_mDBF.10M.R1.fq.gz --reversereads ./ATP5MC1_Rep1_pDBF.10M.R2.fq.gz,./ATP5MC1_Rep1_mDBF.10M.R2.fq.gz --samplename pDBF10M,mDBF10M --lib_type LEXO + +''' +def runExtract(r1, r2, samplename, lib_type): + if not os.path.exists('UMI_fastq'): + os.mkdir('UMI_fastq') + + cwd = os.getcwd() + outdir = os.path.join(cwd, 'UMI_fastq') + + r1 = r1.split(",") + r2 = r2.split(",") + samplename = samplename.split(",") + + for idx, sample in enumerate(samplename): + + reads1 = r1[idx] + reads2 = r2[idx] + output1 = outdir + '/' + sample + '.R1.fq.gz' + output2 = outdir + '/' + sample + '.R2.fq.gz' + + if lib_type == "LEXO": + command = ["umi_tools", "extract", "-I", reads1, '--bc-pattern=NNNNNN', '--read2-in={0}'.format(reads2), '--stdout={0}'.format(output1),'--read2-out={0}'.format(output2)] + elif lib_type == "SA": + command = ["umi_tools", "extract", "-I", reads2, '--bc-pattern=NNNNNNNNNNNN', '--read2-in={0}'.format(reads2), '--stdout={0}'.format(output1),'--read2-out={0}'.format(output2)] + else: + print('--lib_type must be either "LEXO" or "SA"') + sys.exit() + + print('Extracting UMIs for {0}...'.format(sample)) + subprocess.run(command) + print('Finished Extracting UMIs for {0}!'.format(sample)) + +if __name__ == '__main__': + parser = argparse.ArgumentParser(description = 'Extract UMIs using umi-tools in preparation for analysis with AlignUMIquant.') + parser.add_argument('--forwardreads', type = str, help = 'Forward reads. Gzipped fastq.', required = True) + parser.add_argument('--reversereads', type = str, help = 'Reverse reads. Gzipped fastq.', required = True) + parser.add_argument('--samplename', type = str, help = 'Sample name. Will be appended to output files.', required = True) + parser.add_argument('--lib_type', type = str, help = 'Library type. Either "LEXO" or "SA"', required = True) + args = parser.parse_args() + + r1 = os.path.abspath(args.forwardreads) + r2 = os.path.abspath(args.reversereads) + samplename = args.samplename + lib_type = args.lib_type + + if args.lib_type not in ["LEXO", "SA"]: + print('--lib_type must be either "LEXO" or "SA"') + sys.exit() + + runExtract(r1, r2, samplename, lib_type) + diff --git a/src/pigpen/__init__.py b/src/pigpen/__init__.py new file mode 100644 index 0000000..e69de29 diff --git a/src/pigpen/alignAndQuant.py b/src/pigpen/alignAndQuant.py new file mode 100644 index 0000000..b04a3c3 --- /dev/null +++ b/src/pigpen/alignAndQuant.py @@ -0,0 +1,255 @@ +#!/usr/bin/env python + +import os +import subprocess +import sys +import shutil +import argparse +import pysam + +#Given a pair of read files, align reads using STAR and quantify/align reads using salmon. +#This will make a STAR-produced bam (for pigpen mutation calling) and a salmon-produced bam (for read assignment). +#It will then run postmaster to append transcript assignments to the salmon-produced bam. + +#This is going to take in gzipped fastqs, a directory containing the STAR index for this genome, and a directory containing the salmon index for this genome. + +#Reads are aligned to the genome using STAR. This bam file will be used for mutation calling. Uniquely aligning reads from this alignment are then +#written to temporary fastq files (.unique.r1..fq.gz), which are then used for salmon and postmaster. + +#When runSTAR(), bamtofastq(), runSalmon(), and runPostmaster() are run in succession, the output is a directory called . +#In this directory, the STAR output is Aligned.sortedByCoord.out.bam in STAR/, +#the salmon output is .quant.sf and .salmon.bam in salmon/, +#and the postmaster output is .postmaster.bam in postmaster/ + +#maxmap parameter can be used for filtering multimapping reads + +#Requires STAR, salmon(>= 1.9.0), and postmaster be in user's PATH. + +parser = argparse.ArgumentParser(description = 'Align and quantify reads using STAR, salmon, and postmaster in preparation for analysis with PIGPEN.') +parser.add_argument('--forwardreads', type = str, help = 'Forward reads. Gzipped fastq.') +parser.add_argument('--reversereads', type = str, help = 'Reverse reads. Gzipped fastq. Do not supply if using single end reads.') +parser.add_argument('--nthreads', type = str, help = 'Number of threads to use for alignment and quantification.') +parser.add_argument('--STARindex', type = str, help = 'STAR index directory.') +parser.add_argument('--salmonindex', type = str, help = 'Salmon index directory.') +parser.add_argument('--samplename', type = str, help = 'Sample name. Will be appended to output files.') +parser.add_argument('--maxmap', type = int, help = 'Maximum number of allowable alignments for a read.') +args = parser.parse_args() + +def runSTAR(reads1, reads2, nthreads, STARindex, samplename): + if not os.path.exists('STAR'): + os.mkdir('STAR') + + cwd = os.getcwd() + outdir = os.path.join(cwd, 'STAR') + + #Clean output directory if it already exists + if os.path.exists(outdir) and os.path.isdir(outdir): + shutil.rmtree(outdir) + + os.mkdir(outdir) + prefix = outdir + '/' + samplename + + + if reads2: + command = ['STAR', '--runMode', 'alignReads', '--runThreadN', nthreads, '--genomeLoad', 'NoSharedMemory', '--genomeDir', STARindex, '--readFilesIn', reads1, reads2, '--readFilesCommand', + 'zcat', '--outFileNamePrefix', prefix, '--outSAMtype', 'BAM', 'SortedByCoordinate', '--outSAMstrandField', 'intronMotif', '--outSAMmultNmax', '1', '--outSAMattributes', 'MD', 'NH'] + + elif not reads2: + command = ['STAR', '--runMode', 'alignReads', '--runThreadN', nthreads, '--genomeLoad', 'NoSharedMemory', '--genomeDir', STARindex, '--readFilesIn', reads1, '--readFilesCommand', + 'zcat', '--outFileNamePrefix', prefix, '--outSAMtype', 'BAM', 'SortedByCoordinate', '--outSAMstrandField', 'intronMotif', '--outSAMmultNmax', '1', '--outSAMattributes', 'MD', 'NH'] + + print('Running STAR for {0}...'.format(samplename)) + + subprocess.run(command) + + #make index + bam = os.path.join(outdir, samplename + 'Aligned.sortedByCoord.out.bam') + bamindex = bam + '.bai' + if not os.path.exists(bamindex): + indexCMD = 'samtools index ' + bam + index = subprocess.Popen(indexCMD, shell=True) + index.wait() + + print('Finished STAR for {0}!'.format(samplename)) + +def filterbam(samplename, maxmap): + #Take a bam and filter it, only keeping reads that map to <= maxmap locations using NH:i tag + #For some reason whether STAR uses --outFilterMultiMapNmax is unpredictably variable, so we will do it this way. + cwd = os.getcwd() + outdir = os.path.join(cwd, 'STAR') + maxmap = int(maxmap) + inbam = os.path.join(outdir, samplename + 'Aligned.sortedByCoord.out.bam') + outbam = os.path.join(outdir, samplename + 'Aligned.sortedByCoord.multifiltered.out.bam') + + print('Removing reads with > {0} alignments...'.format(maxmap)) + + with pysam.AlignmentFile(inbam, 'rb') as infh, pysam.AlignmentFile(outbam, 'wb', template = infh) as outfh: + readcount = 0 + filteredreadcount = 0 + for read in infh.fetch(until_eof = True): + readcount +=1 + nh = read.get_tag('NH') + if nh <= maxmap: + filteredreadcount +=1 + outfh.write(read) + + #Remove unfiltered bam and its index + os.remove(inbam) + os.remove(inbam + '.bai') + #Rename filtered bam so that it has the same name as the original + #This helps later when pipgen is looking for bams with certain expected names + os.rename(outbam, inbam) + #index filtered bam + bamindex = inbam + '.bai' + indexCMD = 'samtools index ' + inbam + index = subprocess.Popen(indexCMD, shell=True) + index.wait() + + filteredpct = round((filteredreadcount / readcount) * 100, 3) + + print('Looked through {0} reads. {1} ({2}%) had {3} or fewer alignments.'.format(readcount, filteredreadcount, filteredpct, maxmap)) + + +def bamtofastq(samplename, nthreads, reads2): + #Given a bam file of aligned reads (produced from runSTAR), rederive these reads as fastq in preparation for submission to salmon + if not os.path.exists('STAR'): + os.mkdir('STAR') + + cwd = os.getcwd() + outdir = os.path.join(cwd, 'STAR') + inbam = os.path.join(outdir, samplename + 'Aligned.sortedByCoord.out.bam') + sortedbam = os.path.join(outdir, 'temp.namesort.bam') + + #First sort bam file by readname + print('Sorting bam file by read name...') + command = ['samtools', 'collate', '--threads', nthreads, '-u', '-o', sortedbam, inbam] + subprocess.call(command) + print('Done!') + + #Now derive fastq + r1file = samplename + '.aligned.r1.fq.gz' + r2file = samplename + '.aligned.r2.fq.gz' + print('Writing fastq file of aligned reads for {0}...'.format(samplename)) + if reads2: + command = ['samtools', 'fastq', '--threads', nthreads, '-1', r1file, '-2', r2file, '-0', '/dev/null', '-s', '/dev/null', '-n', sortedbam] + elif not reads2: + command = ['samtools', 'fastq', '--threads', nthreads, '-0', r1file, '-n', sortedbam] + subprocess.call(command) + print('Done writing fastq files for {0}!'.format(samplename)) + + os.remove(sortedbam) + + +def runSalmon(reads1, reads2, nthreads, salmonindex, samplename): + #Take in those aligning reads and quantify transcript abundance with them using salmon. + + if not os.path.exists('salmon'): + os.mkdir('salmon') + + idx = os.path.abspath(salmonindex) + r1 = os.path.abspath(reads1) + if reads2: + r2 = os.path.abspath(reads2) + + os.chdir('salmon') + + if reads2: + command = ['salmon', 'quant', '--libType', 'A', '-p', nthreads, '--seqBias', '--gcBias', + '--validateMappings', '-1', r1, '-2', r2, '-o', samplename, '--index', idx, '--writeMappings={0}.salmon.bam'.format(samplename), '--writeQualities'] + elif not reads2: + command = ['salmon', 'quant', '--libType', 'A', '-p', nthreads, '--seqBias', + '--validateMappings', '-r', r1, '-o', samplename, '--index', idx, '--writeMappings={0}.salmon.bam'.format(samplename), '--writeQualities'] + + print('Running salmon for {0}...'.format(samplename)) + + subprocess.run(command) + + #Move output + outputdir = os.path.join(os.getcwd(), samplename) + quantfile = os.path.join(outputdir, 'quant.sf') + movedquantfile = os.path.join(os.getcwd(), '{0}.quant.sf'.format(samplename)) + os.rename(quantfile, movedquantfile) + + print('Finished salmon for {0}!'.format(samplename)) + +def runPostmaster(samplename, nthreads): + if not os.path.exists('postmaster'): + os.mkdir('postmaster') + + salmonquant = os.path.join(os.getcwd(), 'salmon', '{0}.quant.sf'.format(samplename)) + salmonbam = os.path.join(os.getcwd(), 'salmon', '{0}.salmon.bam'.format(samplename)) + + os.chdir('postmaster') + outputfile = os.path.join(os.getcwd(), '{0}.postmaster.bam'.format(samplename)) + + print('Running postmaster for {0}...'.format(samplename)) + command = ['postmaster', '--num-threads', nthreads, '--quant', salmonquant, '--alignments', salmonbam, '--output', outputfile] + subprocess.call(command) + + #Sort and index bam + with open(outputfile + '.sort', 'w') as sortedfh: + command = ['samtools', 'sort', '-@', nthreads, outputfile] + subprocess.run(command, stdout = sortedfh) + + os.rename(outputfile + '.sort', outputfile) + + command = ['samtools', 'index', outputfile] + subprocess.run(command) + + #We don't need the salmon alignment file anymore, and it's pretty big + os.remove(salmonbam) + + print('Finished postmaster for {0}!'.format(samplename)) + +def addMD(samplename, reffasta, nthreads): + inputbam = os.path.join(os.getcwd(), 'postmaster', '{0}.postmaster.bam'.format(samplename)) + command = ['samtools', 'calmd', '-b', '--threads', nthreads, inputbam, reffasta] + + print('Adding MD tags to {0}.postmaster.md.bam...'.format(samplename)) + with open(samplename + '.postmaster.md.bam', 'w') as outfile: + subprocess.run(command, stdout = outfile) + print('Finished adding MD tags to {0}.postmaster.md.bam!'.format(samplename)) + +def main(): + r1 = os.path.abspath(args.forwardreads) + if args.reversereads: + r2 = os.path.abspath(args.reversereads) + elif not args.reversereads: + r2 = None + STARindex = os.path.abspath(args.STARindex) + salmonindex = os.path.abspath(args.salmonindex) + samplename = args.samplename + nthreads = args.nthreads + maxmap = args.maxmap + + wd = os.path.abspath(os.getcwd()) + sampledir = os.path.join(wd, samplename) + if os.path.exists(sampledir) and os.path.isdir(sampledir): + shutil.rmtree(sampledir) + os.mkdir(sampledir) + os.chdir(sampledir) + + runSTAR(r1, r2, nthreads, STARindex, samplename) + if maxmap: + filterbam(samplename, maxmap) + + #aligned read files + alignedr1 = samplename + '.aligned.r1.fq.gz' + if args.reversereads: + alignedr2 = samplename + '.aligned.r2.fq.gz' + elif not args.reversereads: + alignedr2 = None + bamtofastq(samplename, nthreads, r2) + runSalmon(alignedr1, alignedr2, nthreads, salmonindex, samplename) + #Remove aligned fastqs + os.chdir(sampledir) + os.remove(alignedr1) + if args.reversereads: + os.remove(alignedr2) + os.chdir(sampledir) + runPostmaster(samplename, nthreads) + +if __name__ == '__main__': + main() + + diff --git a/src/pigpen/alignAndQuant2.py b/src/pigpen/alignAndQuant2.py new file mode 100644 index 0000000..a2c2337 --- /dev/null +++ b/src/pigpen/alignAndQuant2.py @@ -0,0 +1,199 @@ +import os +import subprocess +import sys +import shutil +import argparse + +#Given a pair of read files, align reads using STAR and quantify/align reads using salmon. +#This will make a STAR-produced bam (for pigpen mutation calling) and a salmon-produced bam (for read assignment). +#It will then run postmaster to append transcript assignments to the salmon-produced bam. + +#This is going to take in gzipped fastqs, a directory containing the STAR index for this genome, and a directory containing the salmon index for this genome. + +#Reads are aligned to the genome using STAR. This bam file will be used for mutation calling. +#In this alignment, we allow multiple mapping reads, but only report the best alignment. + +#Reads are then quantified using salmon, where a separate transcriptome-oriented bam is written. +#Postmaster then takes this bam and adds posterior probabilities for transcript assignments. + +#alignAndQuant.py only gives uniquely aligned reads to salmon. alignAndQuant2.py gives all reads to salmon. + +#When runSTAR(), bamtofastq(), runSalmon(), and runPostmaster() are run in succession, the output is a directory called . +#In this directory, the STAR output is Aligned.sortedByCoord.out.bam in STAR/, +#the salmon output is .quant.sf and .salmon.bam in salmon/, +#and the postmaster output is .postmaster.bam in postmaster/ + +#Requires STAR, salmon(>= 1.9.0), and postmaster be in user's PATH. + +def runSTAR(reads1, reads2, nthreads, STARindex, samplename): + if not os.path.exists('STAR'): + os.mkdir('STAR') + + cwd = os.getcwd() + outdir = os.path.join(cwd, 'STAR') + + #Clean output directory if it already exists + if os.path.exists(outdir) and os.path.isdir(outdir): + shutil.rmtree(outdir) + + os.mkdir(outdir) + prefix = outdir + '/' + samplename + + if reads2: + command = ['STAR', '--runMode', 'alignReads', '--runThreadN', nthreads, '--genomeLoad', 'NoSharedMemory', '--genomeDir', STARindex, '--readFilesIn', reads1, reads2, '--readFilesCommand', + 'zcat', '--outFileNamePrefix', prefix, '--outSAMtype', 'BAM', 'SortedByCoordinate', '--outSAMstrandField', 'intronMotif', '--outSAMmultNmax', '1', '--outSAMattributes', 'MD', 'NH'] + + elif not reads2: + command = ['STAR', '--runMode', 'alignReads', '--runThreadN', nthreads, '--genomeLoad', 'NoSharedMemory', '--genomeDir', STARindex, '--readFilesIn', reads1, '--readFilesCommand', + 'zcat', '--outFileNamePrefix', prefix, '--outSAMtype', 'BAM', 'SortedByCoordinate', '--outSAMstrandField', 'intronMotif', '--outSAMmultNmax', '1', '--outSAMattributes', 'MD', 'NH'] + + + print('Running STAR for {0}...'.format(samplename)) + + subprocess.run(command) + + #make index + bam = os.path.join(outdir, samplename + 'Aligned.sortedByCoord.out.bam') + bamindex = bam + '.bai' + if not os.path.exists(bamindex): + indexCMD = 'samtools index ' + bam + index = subprocess.Popen(indexCMD, shell=True) + index.wait() + + print('Finished STAR for {0}!'.format(samplename)) + + +def bamtofastq(samplename, nthreads, reads2): + #Given a bam file of uniquely aligned reads (produced from runSTAR), rederive these reads as fastq in preparation for submission to salmon + #This function isn't needed anymore as we will align all reads. + if not os.path.exists('STAR'): + os.mkdir('STAR') + + cwd = os.getcwd() + outdir = os.path.join(cwd, 'STAR') + inbam = os.path.join(outdir, samplename + 'Aligned.sortedByCoord.out.bam') + sortedbam = os.path.join(outdir, 'temp.namesort.bam') + + #First sort bam file by readname + print('Sorting bam file by read name...') + command = ['samtools', 'collate', '--threads', nthreads, '-u', '-o', sortedbam, inbam] + subprocess.call(command) + print('Done!') + + #Now derive fastq + r1file = samplename + '.unique.r1.fq.gz' + r2file = samplename + '.unique.r2.fq.gz' + print('Writing fastq file of uniquely aligned reads for {0}...'.format(samplename)) + if reads2: + command = ['samtools', 'fastq', '--threads', nthreads, '-1', r1file, '-2', r2file, '-0', '/dev/null', '-s', '/dev/null', '-n', sortedbam] + elif not reads2: + command = ['samtools', 'fastq', '--threads', nthreads, '-0', r1file, '-n', sortedbam] + subprocess.call(command) + print('Done writing fastq files for {0}!'.format(samplename)) + + os.remove(sortedbam) + + +def runSalmon(reads1, reads2, nthreads, salmonindex, samplename): + + if not os.path.exists('salmon'): + os.mkdir('salmon') + + idx = os.path.abspath(salmonindex) + r1 = os.path.abspath(reads1) + if reads2: + r2 = os.path.abspath(reads2) + + os.chdir('salmon') + + if reads2: + command = ['salmon', 'quant', '--libType', 'A', '-p', nthreads, '--seqBias', '--gcBias', + '--validateMappings', '-1', r1, '-2', r2, '-o', samplename, '--index', idx, '--writeMappings={0}.salmon.bam'.format(samplename), '--writeQualities'] + elif not reads2: + command = ['salmon', 'quant', '--libType', 'A', '-p', nthreads, '--seqBias', + '--validateMappings', '-r', r1, '-o', samplename, '--index', idx, '--writeMappings={0}.salmon.bam'.format(samplename), '--writeQualities'] + + print('Running salmon for {0}...'.format(samplename)) + + subprocess.run(command) + + #Move output + outputdir = os.path.join(os.getcwd(), samplename) + quantfile = os.path.join(outputdir, 'quant.sf') + movedquantfile = os.path.join(os.getcwd(), '{0}.quant.sf'.format(samplename)) + os.rename(quantfile, movedquantfile) + + + print('Finished salmon for {0}!'.format(samplename)) + +def runPostmaster(samplename, nthreads): + if not os.path.exists('postmaster'): + os.mkdir('postmaster') + + salmonquant = os.path.join(os.getcwd(), 'salmon', '{0}.quant.sf'.format(samplename)) + salmonbam = os.path.join(os.getcwd(), 'salmon', '{0}.salmon.bam'.format(samplename)) + + os.chdir('postmaster') + outputfile = os.path.join(os.getcwd(), '{0}.postmaster.bam'.format(samplename)) + + print('Running postmaster for {0}...'.format(samplename)) + command = ['postmaster', '--num-threads', nthreads, '--quant', salmonquant, '--alignments', salmonbam, '--output', outputfile] + subprocess.call(command) + + #Sort and index bam + with open(outputfile + '.sort', 'w') as sortedfh: + command = ['samtools', 'sort', '-@', nthreads, outputfile] + subprocess.run(command, stdout = sortedfh) + + os.rename(outputfile + '.sort', outputfile) + + command = ['samtools', 'index', outputfile] + subprocess.run(command) + + #We don't need the salmon alignment file anymore, and it's pretty big + os.remove(salmonbam) + + print('Finished postmaster for {0}!'.format(samplename)) + +def addMD(samplename, reffasta, nthreads): + inputbam = os.path.join(os.getcwd(), 'postmaster', '{0}.postmaster.bam'.format(samplename)) + command = ['samtools', 'calmd', '-b', '--threads', nthreads, inputbam, reffasta] + + print('Adding MD tags to {0}.postmaster.md.bam...'.format(samplename)) + with open(samplename + '.postmaster.md.bam', 'w') as outfile: + subprocess.run(command, stdout = outfile) + print('Finished adding MD tags to {0}.postmaster.md.bam!'.format(samplename)) + +if __name__ == '__main__': + parser = argparse.ArgumentParser(description = 'Align and quantify reads using STAR, salmon, and postmaster in preparation for analysis with PIGPEN.') + parser.add_argument('--forwardreads', type = str, help = 'Forward reads. Gzipped fastq.') + parser.add_argument('--reversereads', type = str, help = 'Reverse reads. Gzipped fastq. Do not supply if using single end reads.') + parser.add_argument('--nthreads', type = str, help = 'Number of threads to use for alignment and quantification.') + parser.add_argument('--STARindex', type = str, help = 'STAR index directory.') + parser.add_argument('--salmonindex', type = str, help = 'Salmon index directory.') + parser.add_argument('--samplename', type = str, help = 'Sample name. Will be appended to output files.') + args = parser.parse_args() + + r1 = os.path.abspath(args.forwardreads) + if args.reversereads: + r2 = os.path.abspath(args.reversereads) + elif not args.reversereads: + r2 = None + STARindex = os.path.abspath(args.STARindex) + salmonindex = os.path.abspath(args.salmonindex) + samplename = args.samplename + nthreads = args.nthreads + + wd = os.path.abspath(os.getcwd()) + sampledir = os.path.join(wd, samplename) + if os.path.exists(sampledir) and os.path.isdir(sampledir): + shutil.rmtree(sampledir) + os.mkdir(sampledir) + os.chdir(sampledir) + + runSTAR(r1, r2, nthreads, STARindex, samplename) + runSalmon(r1, r2, nthreads, salmonindex, samplename) + os.chdir(sampledir) + runPostmaster(samplename, nthreads) + + diff --git a/src/pigpen/alignUMIquant.py b/src/pigpen/alignUMIquant.py new file mode 100644 index 0000000..cc0108c --- /dev/null +++ b/src/pigpen/alignUMIquant.py @@ -0,0 +1,288 @@ +import os +import subprocess +import sys +import shutil +import argparse +import pysam + +''' +Given a pair of read files, align reads using STAR, deduplicate reads by UMI, and quantify reads using salmon. +This will make a STAR-produced bam (for pigpen mutation calling) +This bam will be deduplicated with UMI-tools then passed to salmon(for read assignment). +It will then run postmaster to append transcript assignments to the salmon-produced bam. + +This is going to take in gzipped fastqs with UMIs extracted, +a directory containing the STAR index for this genome, and a directory containing the salmon index for this genome. + +This means that, in addition to any adapter trimming, ***reads must have been first processed with umi_tools extract***. +For quantseq libraries, this corresponds to the first 6 nt of read 1. + +Reads are aligned to the genome using STAR. This bam file will be used for mutation calling. +In this alignment, we allow multiple mapping reads, but only report the best alignment. +This bam will then be deduplicated based on UMI and alignment position. + +Reads are then quantified using salmon, where a separate transcriptome-oriented bam is written. +Postmaster then takes this bam and adds posterior probabilities for transcript assignments. + +When runSTAR(), runSalmon(), and runPostmaster() are run in succession, the output is a directory called . +In this directory, the STAR output is .Aligned.sortedByCoord.out.bam in STAR/, +the salmon output is .quant.sf and .salmon.bam in salmon/, +and the postmaster output is .postmaster.bam in postmaster/ + +Requires STAR, salmon(>= 1.9.0), and postmaster be in user's PATH. +''' + +def runSTAR(reads1, reads2, nthreads, STARindex, samplename): + if not os.path.exists('STAR'): + os.mkdir('STAR') + + cwd = os.getcwd() + outdir = os.path.join(cwd, 'STAR') + + #Clean output directory if it already exists + if os.path.exists(outdir) and os.path.isdir(outdir): + shutil.rmtree(outdir) + + os.mkdir(outdir) + prefix = outdir + '/' + samplename + + command = ['STAR', '--runMode', 'alignReads', '--runThreadN', nthreads, '--genomeLoad', 'NoSharedMemory', '--genomeDir', STARindex, '--readFilesIn', reads1, reads2, '--readFilesCommand', + 'zcat', '--outFileNamePrefix', prefix, '--outSAMtype', 'BAM', 'SortedByCoordinate', '--outSAMstrandField', 'intronMotif', '--outSAMmultNmax', '1', '--outSAMattributes', 'MD', 'NH'] + + print('Running STAR for {0}...'.format(samplename)) + + subprocess.run(command) + + #make index + bam = os.path.join(outdir, samplename + 'Aligned.sortedByCoord.out.bam') + bamindex = bam + '.bai' + if not os.path.exists(bamindex): + indexCMD = 'samtools index ' + bam + index = subprocess.Popen(indexCMD, shell=True) + index.wait() + + print('Finished STAR for {0}!'.format(samplename)) + + +def filterbam(samplename, maxmap): + #Take a bam and filter it, only keeping reads that map to <= maxmap locations using NH:i tag + #For some reason whether STAR uses --outFilterMultiMapNmax is unpredictably variable, so we will do it this way. + cwd = os.getcwd() + outdir = os.path.join(cwd, 'STAR') + maxmap = int(maxmap) + inbam = os.path.join(outdir, samplename + 'Aligned.sortedByCoord.out.bam') + outbam = os.path.join(outdir, samplename + + 'Aligned.sortedByCoord.multifiltered.out.bam') + + print('Removing reads with > {0} alignments...'.format(maxmap)) + + with pysam.AlignmentFile(inbam, 'rb') as infh, pysam.AlignmentFile(outbam, 'wb', template=infh) as outfh: + readcount = 0 + filteredreadcount = 0 + for read in infh.fetch(until_eof=True): + readcount += 1 + nh = read.get_tag('NH') + if nh <= maxmap: + filteredreadcount += 1 + outfh.write(read) + + #Remove unfiltered bam and its index + os.remove(inbam) + os.remove(inbam + '.bai') + #Rename filtered bam so that it has the same name as the original + #This helps later when pipgen is looking for bams with certain expected names + os.rename(outbam, inbam) + #index filtered bam + bamindex = inbam + '.bai' + indexCMD = 'samtools index ' + inbam + index = subprocess.Popen(indexCMD, shell=True) + index.wait() + + filteredpct = round((filteredreadcount / readcount) * 100, 3) + + print('Looked through {0} reads. {1} ({2}%) had {3} or fewer alignments.'.format( + readcount, filteredreadcount, filteredpct, maxmap)) + + +def runDedup(samplename, nthreads): + STARbam = os.path.join(os.getcwd(), 'STAR', '{0}Aligned.sortedByCoord.out.bam'.format(samplename)) + dedupbam = os.path.join(os.getcwd(), 'STAR', '{0}.dedup.bam'.format(samplename)) + if args.libType == "LEXO": + command = ['umi_tools', 'dedup', '-I', STARbam, '--paired', '-S', dedupbam] + elif args.libType == "SA": + command = ['umi_tools', 'dedup', '-I', STARbam, '--paired', '--method=unique', '-S', dedupbam] + else: + print('LibType must be either "LEXO" or "SA".') + print('Running deduplication for {0}...'.format(samplename)) + + subprocess.run(command) + + command = ['samtools', 'index', dedupbam] + subprocess.run(command) + + #We don't need the STAR alignment file anymore, and it's pretty big + #Rename to the old name so downstream code finds the bams it's looking for + os.rename(dedupbam, STARbam) + #Reindex deduplicated bam + bamindex = STARbam + '.bai' + indexCMD = 'samtools index ' + STARbam + index = subprocess.Popen(indexCMD, shell=True) + index.wait() + + print('Finished deduplicating {0}!'.format(samplename)) + + +def bamtofastq(samplename, nthreads, dedup, reads2): + #Given a bam file of uniquely aligned reads (produced from runSTAR), rederive these reads as fastq in preparation for submission to salmon + if not os.path.exists('STAR'): + os.mkdir('STAR') + + cwd = os.getcwd() + outdir = os.path.join(cwd, 'STAR') + inbam = os.path.join(outdir, samplename + 'Aligned.sortedByCoord.out.bam') + sortedbam = os.path.join(outdir, 'temp.namesort.bam') + + #First sort bam file by readname + print('Sorting bam file by read name...') + command = ['samtools', 'collate', '--threads', nthreads, '-u', '-o', sortedbam, inbam] + subprocess.call(command) + print('Done!') + + #Now derive fastq + r1file = samplename + '.aligned.r1.fq.gz' + r2file = samplename + '.aligned.r2.fq.gz' + if dedup: + print('Writing fastq file of deduplicated reads for {0}...'.format(samplename)) + elif not dedup: + print('Writing fastq file of aligned reads for {0}...'.format(samplename)) + if reads2: + command = ['samtools', 'fastq', '--threads', nthreads, '-1', r1file, '-2', r2file, '-0', '/dev/null', '-s', '/dev/null', '-n', sortedbam] + elif not reads2: + command = ['samtools', 'fastq', '--threads', nthreads, '-0', r1file, '-n', sortedbam] + subprocess.call(command) + print('Done writing fastq files for {0}!'.format(samplename)) + + os.remove(sortedbam) + + +def runSalmon(reads1, reads2, nthreads, salmonindex, samplename): + #Take in those deduplicated reads and quantify transcript abundance with them using salmon. + + if not os.path.exists('salmon'): + os.mkdir('salmon') + + idx = os.path.abspath(salmonindex) + r1 = os.path.abspath(reads1) + r2 = os.path.abspath(reads2) + + os.chdir('salmon') + + command = ['salmon', 'quant', '--libType', 'A', '-p', nthreads, '--seqBias', '--gcBias', + '--validateMappings', '-1', r1, '-2', r2, '-o', samplename, '--index', idx, '--writeMappings={0}.salmon.bam'.format(samplename), '--writeQualities'] + + print('Running salmon for {0}...'.format(samplename)) + + subprocess.call(command) + + #Move output + outputdir = os.path.join(os.getcwd(), samplename) + quantfile = os.path.join(outputdir, 'quant.sf') + movedquantfile = os.path.join(os.getcwd(), '{0}.quant.sf'.format(samplename)) + os.rename(quantfile, movedquantfile) + + #Remove uniquely aligning read files + os.remove(r1) + if reads2: + os.remove(r2) + + print('Finished salmon for {0}!'.format(samplename)) + + +def runPostmaster(samplename, nthreads): + if not os.path.exists('postmaster'): + os.mkdir('postmaster') + + salmonquant = os.path.join(os.getcwd(), 'salmon', '{0}.quant.sf'.format(samplename)) + salmonbam = os.path.join(os.getcwd(), 'salmon', '{0}.salmon.bam'.format(samplename)) + + os.chdir('postmaster') + outputfile = os.path.join(os.getcwd(), '{0}.postmaster.bam'.format(samplename)) + + print('Running postmaster for {0}...'.format(samplename)) + command = ['postmaster', '--num-threads', nthreads, '--quant', salmonquant, '--alignments', salmonbam, '--output', outputfile] + subprocess.call(command) + + #Sort and index bam + with open(outputfile + '.sort', 'w') as sortedfh: + command = ['samtools', 'sort', '-@', nthreads, outputfile] + subprocess.run(command, stdout = sortedfh) + os.rename(outputfile + '.sort', outputfile) + + command = ['samtools', 'index', outputfile] + subprocess.run(command) + + #We don't need the salmon alignment file anymore, and it's pretty big + os.remove(salmonbam) + + print('Finished postmaster for {0}!'.format(samplename)) + + +def addMD(samplename, reffasta, nthreads): + inputbam = os.path.join(os.getcwd(), 'postmaster', '{0}.postmaster.bam'.format(samplename)) + command = ['samtools', 'calmd', '-b', '--threads', nthreads, inputbam, reffasta] + + print('Adding MD tags to {0}.postmaster.md.bam...'.format(samplename)) + with open(samplename + '.postmaster.md.bam', 'w') as outfile: + subprocess.run(command, stdout = outfile) + print('Finished adding MD tags to {0}.postmaster.md.bam!'.format(samplename)) + +if __name__ == '__main__': + parser = argparse.ArgumentParser(description = 'Align and quantify reads using STAR, salmon, and postmaster in preparation for analysis with PIGPEN.') + parser.add_argument('--forwardreads', type = str, help = 'Forward reads. Gzipped fastq.') + parser.add_argument('--reversereads', type = str, help = 'Reverse reads. Gzipped fastq.') + parser.add_argument('--nthreads', type = str, help = 'Number of threads to use for alignment and quantification.') + parser.add_argument('--STARindex', type = str, help = 'STAR index directory.') + parser.add_argument('--salmonindex', type = str, help = 'Salmon index directory.') + parser.add_argument('--samplename', type = str, help = 'Sample name. Will be appended to output files.') + parser.add_argument('--dedupUMI', action = 'store_true', help = 'Deduplicate UMIs? requires UMI extract.') + parser.add_argument('--libType', type = str, help = 'Library Type, either "LEXO" or "SA"') + parser.add_argument( + '--maxmap', type=int, help='Maximum number of allowable alignments for a read.') + args = parser.parse_args() + + r1 = os.path.abspath(args.forwardreads) + if args.reversereads: + r2 = os.path.abspath(args.reversereads) + elif not args.reversereads: + r2 = None + STARindex = os.path.abspath(args.STARindex) + salmonindex = os.path.abspath(args.salmonindex) + samplename = args.samplename + nthreads = args.nthreads + maxmap = args.maxmap + + wd = os.path.abspath(os.getcwd()) + sampledir = os.path.join(wd, samplename) + if os.path.exists(sampledir) and os.path.isdir(sampledir): + shutil.rmtree(sampledir) + os.mkdir(sampledir) + os.chdir(sampledir) + + + runSTAR(r1, r2, nthreads, STARindex, samplename) + filterbam(samplename, maxmap) + if args.dedupUMI: + runDedup(samplename, nthreads) + + #aligned read files + alignedr1 = samplename + '.aligned.r1.fq.gz' + if args.reversereads: + alignedr2 = samplename + '.aligned.r2.fq.gz' + elif not args.reversereads: + alignedr2 = None + bamtofastq(samplename, nthreads, args.dedupUMI, r2) + runSalmon(alignedr1, alignedr2, nthreads, salmonindex, samplename) + #Remove aligned fastqs + os.chdir(sampledir) + os.chdir(sampledir) + runPostmaster(samplename, nthreads) diff --git a/assignreads.py b/src/pigpen/assignreads.py similarity index 96% rename from assignreads.py rename to src/pigpen/assignreads.py index 3021ff6..d789828 100644 --- a/assignreads.py +++ b/src/pigpen/assignreads.py @@ -54,8 +54,9 @@ def processOverlaps(overlaps, numpairs): txs = overlaps[read] maxtx = max(txs, key = txs.get) overlaplength = txs[maxtx] #can implement minimum overlap here - gene = maxtx.split('_')[0] - read2gene[read] = gene + if overlaplength >= 80: + gene = maxtx.split('_')[0] + read2gene[read] = gene frac_readpairs_with_gene = round((len(read2gene) / numpairs) * 100, 2) print('Found genes for {0} read pairs ({1}%).'.format(len(read2gene), frac_readpairs_with_gene)) diff --git a/src/pigpen/assignreads_salmon.py b/src/pigpen/assignreads_salmon.py new file mode 100644 index 0000000..906b211 --- /dev/null +++ b/src/pigpen/assignreads_salmon.py @@ -0,0 +1,293 @@ +import os +import sys +import pysam +import pickle +import gffutils +import numpy as np + + +#Take in a dictionary of {readid : conversions} (made by getmismatches.py) and a postmaster-enhanced bam (made by alignAndQuant.py). +#First, construct dictionary of {readid : {txid : fractional assignment}}. Then, combining this dictionary with the previous one, +#count the number of conversions associated with each transcript. Finally (and I guess optionally), using a genome annotation file, +#collapse transcript level conversion counts to gene-level conversion counts. + +def getpostmasterassignments(postmasterbam): + #Given a postmaster-produced bam, make a dictionary of the form {readid : {txid : fractional assignment}} + #It looks like in a postmaster bam that paired end reads are right after each other and are always + #given the same fractional assignments. This means we can probably just consider R1 reads. + + pprobs = {} #{readid : {txid : pprob}} + + with pysam.AlignmentFile(postmasterbam, 'r') as bamfh: + for read in bamfh.fetch(until_eof = True): + if read.is_read2: + continue + readid = read.query_name + tx = read.reference_name.split('.')[0] + pprob = read.get_tag(tag='ZW') + if readid not in pprobs: + pprobs[readid] = {} + pprobs[readid][tx] = pprob + + return pprobs + +def assigntotxs(pprobs, convs): + #Intersect posterior probabilities of read assignments to transcripts with conversion counts of those reads. + #The counts assigned to a tx by a read are scaled by the posterior probability that a read came from that transcript. + + #pprobs = #{readid : {txid : pprob}} + #produced from getpostmasterassignments() + #convs = #{readid : {a_a : 200, a_t : 1, etc.}} + print('Finding transcript assignments for {0} reads.'.format(len(convs))) + readswithoutassignment = 0 #number of reads which exist in convs but not in pprobs (i.e. weren't assigned to a transcript by salmon) + assignedreads = 0 #number of reads in convs for which we found a match in pprobs + + txconvs = {} # {txid : {a_a : 200, a_t : 1, etc.}} + + for readid in pprobs: + + try: + readconvs = convs[readid] + assignedreads +=1 + except KeyError: #we couldn't find this read in convs + readswithoutassignment +=1 + continue + + for txid in pprobs[readid]: + txid = txid.split('.')[0] + if txid not in txconvs: + txconvs[txid] = {} + pprob = pprobs[readid][txid] + for conv in readconvs: + scaledconv = readconvs[conv] * pprob + if conv not in txconvs[txid]: + txconvs[txid][conv] = scaledconv + else: + txconvs[txid][conv] += scaledconv + + readswithtxs = len(convs) - readswithoutassignment + readswithtxs = assignedreads + pct = round(readswithtxs / len(convs), 2) * 100 + print('Found transcripts for {0} of {1} reads ({2}%).'.format(readswithtxs, len(convs), pct)) + + return txconvs + +def collapsetogene(txconvs, gff): + #Collapse tx-level count measurements to gene level. + #Need to relate transcripts and genes. Do that with the supplied gff annotation. + #txconvs = {txid : {a_a : 200, a_t : 1, etc.}} + + tx2gene = {} #{txid : geneid} + geneid2genename = {} #{geneid : genename} + geneconvs = {} # {geneid : {a_a : 200, a_t : 1, etc.}} + + print('Indexing gff..') + gff_fn = gff + db_fn = os.path.abspath(gff_fn) + '.db' + if os.path.isfile(db_fn) == False: + gffutils.create_db(gff_fn, db_fn, merge_strategy='merge', verbose=True) + print('Done indexing!') + + db = gffutils.FeatureDB(db_fn) + genes = db.features_of_type('gene') + + print('Connecting transcripts and genes...') + for gene in genes: + geneid = str(gene.id).split('.')[0] #remove version numbers + genename = gene.attributes['gene_name'][0] + geneid2genename[geneid] = genename + for tx in db.children(gene, featuretype = 'transcript'): + txid = str(tx.id).split('.')[0] + tx2gene[txid] = geneid + print('Done!') + + allgenes = list(set(tx2gene.values())) + + #Initialize geneconvs dictionary + possibleconvs = [ + 'a_a', 'a_t', 'a_c', 'a_g', 'a_n', + 'g_a', 'g_t', 'g_c', 'g_g', 'g_n', + 'c_a', 'c_t', 'c_c', 'c_g', 'c_n', + 't_a', 't_t', 't_c', 't_g', 't_n', + 'a_x', 'g_x', 'c_x', 't_x', 'ng_xg'] + + for gene in allgenes: + geneconvs[gene] = {} + for conv in possibleconvs: + geneconvs[gene][conv] = 0 + + for tx in txconvs: + try: + gene = tx2gene[tx] + except KeyError: + print('WARNING: transcript {0} doesn\'t belong to a gene in the supplied annotation.'.format(tx)) + continue + convs = txconvs[tx] + for conv in convs: + convcount = txconvs[tx][conv] + geneconvs[gene][conv] += convcount + + return tx2gene, geneid2genename, geneconvs + +def readspergene(quantsf, tx2gene): + #Get the number of reads assigned to each tx. This can simply be read from the salmon quant.sf file. + #Then, sum read counts across all transcripts within a gene. + #Transcript and gene relationships were derived by collapsetogene(). + + txcounts = {} #{txid : readcounts} + genecounts = {} #{geneid : readcounts} + + with open(quantsf, 'r') as infh: + for line in infh: + line = line.strip().split('\t') + if line[0] == 'Name': + continue + txid = line[0].split('.')[0] #remove tx id version in the salmon quant.sf if it exists + counts = float(line[4]) + txcounts[txid] = counts + + allgenes = list(set(tx2gene.values())) + for gene in allgenes: + genecounts[gene] = 0 + + for txid in txcounts: + try: + geneid = tx2gene[txid] + except KeyError: #maybe the salmon tx id have version numbers + txid = txid.split('.')[0] + geneid = tx2gene[txid] + + genecounts[geneid] += txcounts[txid] + + return genecounts + + +def writeOutput(sampleparams, geneconvs, genecounts, geneid2genename, outfile, use_g_t, use_g_c, use_g_x, use_ng_xg): + #Write number of conversions and readcounts for genes. + possibleconvs = [ + 'a_a', 'a_t', 'a_c', 'a_g', 'a_n', + 'g_a', 'g_t', 'g_c', 'g_g', 'g_n', + 'c_a', 'c_t', 'c_c', 'c_g', 'c_n', + 't_a', 't_t', 't_c', 't_g', 't_n', + 'a_x', 'g_x', 'c_x', 't_x', 'ng_xg'] + + with open(outfile, 'w') as outfh: + #Write arguments for this pigpen run + for arg in sampleparams: + outfh.write('#' + arg + '\t' + str(sampleparams[arg]) + '\n') + #total G is number of ref Gs encountered + #convG is g_t + g_c + g_x + ng_xg (the ones we are interested in) + outfh.write(('\t').join(['GeneID', 'GeneName', 'numreads'] + possibleconvs + [ + 'totalG', 'convG', 'convGrate', 'G_Trate', 'G_Crate', 'G_Xrate', 'NG_XGrate', 'porc']) + '\n') + genes = sorted(geneconvs.keys()) + + for gene in genes: + genename = geneid2genename[gene] + numreads = genecounts[gene] + convcounts = [] + c = geneconvs[gene] + for conv in possibleconvs: + convcount = c[conv] + convcounts.append(convcount) + + convcounts = ['{:.2f}'.format(x) for x in convcounts] + + totalG = c['g_g'] + c['g_c'] + c['g_t'] + c['g_a'] + c['g_n'] + c['g_x'] + convG = 0 + possiblegconv = ['g_t', 'g_c', 'g_x', 'ng_xg'] + for ind, x in enumerate([use_g_t, use_g_c, use_g_x, use_ng_xg]): + if x == True: + convG += c[possiblegconv[ind]] + + g_ccount = c['g_c'] + g_tcount = c['g_t'] + g_xcount = c['g_x'] + ng_xgcount = c['ng_xg'] + + totalmut = c['a_t'] + c['a_c'] + c['a_g'] + c['g_t'] + c['g_c'] + c['g_a'] + c['t_a'] + c['t_c'] + c['t_g'] + c['c_t'] + c['c_g'] + c['c_a'] + c['g_x'] + c['ng_xg'] + totalnonmut = c['a_a'] + c['g_g'] + c['c_c'] + c['t_t'] + allnt = totalmut + totalnonmut + + try: + convGrate = convG / totalG + except ZeroDivisionError: + convGrate = 'NA' + + try: + g_crate = g_ccount / totalG + except ZeroDivisionError: + g_crate = 'NA' + + try: + g_trate = g_tcount / totalG + except ZeroDivisionError: + g_trate = 'NA' + + try: + g_xrate = g_xcount / totalG + except ZeroDivisionError: + g_xrate = 'NA' + + try: + ng_xgrate = ng_xgcount / totalG + except ZeroDivisionError: + ng_xgrate = 'NA' + + try: + totalmutrate = totalmut / allnt + except ZeroDivisionError: + totalmutrate = 'NA' + + #normalize convGrate to rate of all mutations + #Proportion Of Relevant Conversions + if totalmutrate == 'NA': + porc = 'NA' + elif totalmutrate > 0: + try: + porc = np.log2(convGrate / totalmutrate) + except: + porc = 'NA' + else: + porc = 'NA' + + #Format numbers for printing + if type(numreads) == float: + numreads = '{:.2f}'.format(numreads) + if type(convG) == float: + convG = '{:.2f}'.format(convG) + if type(totalG) == float: + totalG = '{:.2f}'.format(totalG) + if type(convGrate) == float: + convGrate = '{:.2e}'.format(convGrate) + if type(g_trate) == float: + g_trate = '{:.2e}'.format(g_trate) + if type(g_crate) == float: + g_crate = '{:.2e}'.format(g_crate) + if type(g_xrate) == float: + g_xrate = '{:.2e}'.format(g_xrate) + if type(ng_xgrate) == float: + ng_xgrate = '{:.2e}'.format(ng_xgrate) + if type(porc) == np.float64: + porc = '{:.3f}'.format(porc) + + outfh.write(('\t').join([gene, genename, str(numreads)] + convcounts + [str(totalG), str(convG), str(convGrate), str(g_trate), str(g_crate), str(g_xrate), str(ng_xgrate), str(porc)]) + '\n') + + + + + +if __name__ == '__main__': + print('Getting posterior probabilities from salmon alignment file...') + pprobs = getpostmasterassignments(sys.argv[1]) + print('Done!') + print('Loading conversions from pickle file...') + with open(sys.argv[2], 'rb') as infh: + convs = pickle.load(infh) + print('Done!') + print('Assinging conversions to transcripts...') + txconvs = assigntotxs(pprobs, convs) + print('Done!') + + tx2gene, geneid2genename, geneconvs = collapsetogene(txconvs, sys.argv[3]) + genecounts = readspergene(sys.argv[4], tx2gene) + writeOutput(geneconvs, genecounts, geneid2genename, sys.argv[5], True, True) \ No newline at end of file diff --git a/src/pigpen/assignreads_salmon_ensembl.py b/src/pigpen/assignreads_salmon_ensembl.py new file mode 100644 index 0000000..dd70967 --- /dev/null +++ b/src/pigpen/assignreads_salmon_ensembl.py @@ -0,0 +1,296 @@ +import os +import sys +import pysam +import pickle +import gffutils +import numpy as np + + +#Take in a dictionary of {readid : conversions} (made by getmismatches.py) and a postmaster-enhanced bam (made by alignAndQuant.py). +#First, construct dictionary of {readid : {txid : fractional assignment}}. Then, combining this dictionary with the previous one, +#count the number of conversions associated with each transcript. Finally (and I guess optionally), using a genome annotation file, +#collapse transcript level conversion counts to gene-level conversion counts. + +def getpostmasterassignments(postmasterbam): + #Given a postmaster-produced bam, make a dictionary of the form {readid : {txid : fractional assignment}} + #It looks like in a postmaster bam that paired end reads are right after each other and are always + #given the same fractional assignments. This means we can probably just consider R1 reads. + + pprobs = {} #{readid : {txid : pprob}} + + with pysam.AlignmentFile(postmasterbam, 'r') as bamfh: + for read in bamfh.fetch(until_eof = True): + if read.is_read2: + continue + readid = read.query_name + tx = read.reference_name.split('.')[0] + pprob = read.get_tag(tag='ZW') + if readid not in pprobs: + pprobs[readid] = {} + pprobs[readid][tx] = pprob + + return pprobs + +def assigntotxs(pprobs, convs): + #Intersect posterior probabilities of read assignments to transcripts with conversion counts of those reads. + #The counts assigned to a tx by a read are scaled by the posterior probability that a read came from that transcript. + + #pprobs = #{readid : {txid : pprob}} + #produced from getpostmasterassignments() + #convs = #{readid : {a_a : 200, a_t : 1, etc.}} + print('Finding transcript assignments for {0} reads.'.format(len(convs))) + readswithoutassignment = 0 #number of reads which exist in convs but not in pprobs (i.e. weren't assigned to a transcript by salmon) + assignedreads = 0 #number of reads in convs for which we found a match in pprobs + + txconvs = {} # {txid : {a_a : 200, a_t : 1, etc.}} + + for readid in pprobs: + + try: + readconvs = convs[readid] + assignedreads +=1 + except KeyError: #we couldn't find this read in convs + readswithoutassignment +=1 + continue + + for txid in pprobs[readid]: + txid = txid.split('.')[0] + if txid not in txconvs: + txconvs[txid] = {} + pprob = pprobs[readid][txid] + for conv in readconvs: + scaledconv = readconvs[conv] * pprob + if conv not in txconvs[txid]: + txconvs[txid][conv] = scaledconv + else: + txconvs[txid][conv] += scaledconv + + readswithtxs = len(convs) - readswithoutassignment + readswithtxs = assignedreads + pct = round(readswithtxs / len(convs), 2) * 100 + print('Found transcripts for {0} of {1} reads ({2}%).'.format(readswithtxs, len(convs), pct)) + + return txconvs + +def collapsetogene(txconvs, gff): + #Collapse tx-level count measurements to gene level. + #Need to relate transcripts and genes. Do that with the supplied gff annotation. + #txconvs = {txid : {a_a : 200, a_t : 1, etc.}} + + tx2gene = {} #{txid : geneid} + geneid2genename = {} #{geneid : genename} + geneconvs = {} # {geneid : {a_a : 200, a_t : 1, etc.}} + + print('Indexing gff..') + gff_fn = gff + db_fn = os.path.abspath(gff_fn) + '.db' + if os.path.isfile(db_fn) == False: + gffutils.create_db(gff_fn, db_fn, merge_strategy='merge', verbose=True) + print('Done indexing!') + + db = gffutils.FeatureDB(db_fn) + #Very annoyingly, in some annotations, the third field for gene entries is not necessarily 'gene' + #in all ensembl annotations. For example, at least in an ensembl E. coli annotation, it can be 'ncRNA_gene' + #as well. Probably other things too. + genes = db.features_of_type(('gene', 'ncRNA_gene')) + + print('Connecting transcripts and genes...') + for gene in genes: + geneid = str(gene.id).split('.')[0].replace('gene:', '') #remove version numbers and gene: + #in the ensembl zebrafish gff, some genes don't have a Name attribute + try: + genename = gene.attributes['Name'][0] + except KeyError: + genename = geneid + geneid2genename[geneid] = genename + for tx in db.children(gene, level = 1): #allow feature types other than 'mRNA'. be flexible here. + txid = str(tx.id).split('.')[0].replace('transcript:', '') #remove version numbers and transcript: + tx2gene[txid] = geneid + print('Done!') + + allgenes = list(set(tx2gene.values())) + + #Initialize geneconvs dictionary + possibleconvs = [ + 'a_a', 'a_t', 'a_c', 'a_g', 'a_n', + 'g_a', 'g_t', 'g_c', 'g_g', 'g_n', + 'c_a', 'c_t', 'c_c', 'c_g', 'c_n', + 't_a', 't_t', 't_c', 't_g', 't_n', + 'a_x', 'g_x', 'c_x', 't_x', 'ng_xg'] + + for gene in allgenes: + geneconvs[gene] = {} + for conv in possibleconvs: + geneconvs[gene][conv] = 0 + + for tx in txconvs: + try: + gene = tx2gene[tx] + except KeyError: + print('WARNING: transcript {0} doesn\'t belong to a gene in the supplied annotation.'.format(tx)) + continue + convs = txconvs[tx] + for conv in convs: + convcount = txconvs[tx][conv] + geneconvs[gene][conv] += convcount + + return tx2gene, geneid2genename, geneconvs + +def readspergene(quantsf, tx2gene): + #Get the number of reads assigned to each tx. This can simply be read from the salmon quant.sf file. + #Then, sum read counts across all transcripts within a gene. + #Transcript and gene relationships were derived by collapsetogene(). + + txcounts = {} #{txid : readcounts} + genecounts = {} #{geneid : readcounts} + + with open(quantsf, 'r') as infh: + for line in infh: + line = line.strip().split('\t') + if line[0] == 'Name': + continue + txid = line[0].split('.')[0] #remove tx id version in the salmon quant.sf if it exists + counts = float(line[4]) + txcounts[txid] = counts + + allgenes = list(set(tx2gene.values())) + for gene in allgenes: + genecounts[gene] = 0 + + for txid in txcounts: + try: + geneid = tx2gene[txid] + except KeyError: #maybe the salmon tx id have version numbers + txid = txid.split('.')[0] + geneid = tx2gene[txid] + + genecounts[geneid] += txcounts[txid] + + return genecounts + + +def writeOutput(sampleparams, geneconvs, genecounts, geneid2genename, outfile, use_g_t, use_g_c, use_g_x, use_ng_xg): + #Write number of conversions and readcounts for genes. + possibleconvs = [ + 'a_a', 'a_t', 'a_c', 'a_g', 'a_n', + 'g_a', 'g_t', 'g_c', 'g_g', 'g_n', + 'c_a', 'c_t', 'c_c', 'c_g', 'c_n', + 't_a', 't_t', 't_c', 't_g', 't_n', + 'a_x', 'g_x', 'c_x', 't_x', 'ng_xg'] + + with open(outfile, 'w') as outfh: + #Write arguments for this pigpen run + for arg in sampleparams: + outfh.write('#' + arg + '\t' + str(sampleparams[arg]) + '\n') + #total G is number of ref Gs encountered + #convG is g_t + g_c + g_x + ng_xg (the ones we are interested in) + outfh.write(('\t').join(['GeneID', 'GeneName', 'numreads'] + possibleconvs + [ + 'totalG', 'convG', 'convGrate', 'G_Trate', 'G_Crate', 'G_Xrate', 'NG_XGrate', 'porc']) + '\n') + genes = sorted(geneconvs.keys()) + + for gene in genes: + genename = geneid2genename[gene] + numreads = genecounts[gene] + convcounts = [] + c = geneconvs[gene] + for conv in possibleconvs: + convcount = c[conv] + convcounts.append(convcount) + + convcounts = ['{:.2f}'.format(x) for x in convcounts] + + totalG = c['g_g'] + c['g_c'] + c['g_t'] + c['g_a'] + c['g_n'] + c['g_x'] + convG = 0 + possiblegconv = ['g_t', 'g_c', 'g_x', 'ng_xg'] + for ind, x in enumerate([use_g_t, use_g_c, use_g_x, use_ng_xg]): + if x == True: + convG += c[possiblegconv[ind]] + + g_ccount = c['g_c'] + g_tcount = c['g_t'] + g_xcount = c['g_x'] + ng_xgcount = c['ng_xg'] + + totalmut = c['a_t'] + c['a_c'] + c['a_g'] + c['g_t'] + c['g_c'] + c['g_a'] + c['t_a'] + c['t_c'] + c['t_g'] + c['c_t'] + c['c_g'] + c['c_a'] + c['g_x'] + c['ng_xg'] + totalnonmut = c['a_a'] + c['g_g'] + c['c_c'] + c['t_t'] + allnt = totalmut + totalnonmut + + try: + convGrate = convG / totalG + except ZeroDivisionError: + convGrate = 'NA' + + try: + g_crate = g_ccount / totalG + except ZeroDivisionError: + g_crate = 'NA' + + try: + g_trate = g_tcount / totalG + except ZeroDivisionError: + g_trate = 'NA' + + try: + g_xrate = g_xcount / totalG + except ZeroDivisionError: + g_xrate = 'NA' + + try: + ng_xgrate = ng_xgcount / totalG + except ZeroDivisionError: + ng_xgrate = 'NA' + + try: + totalmutrate = totalmut / allnt + except ZeroDivisionError: + totalmutrate = 'NA' + + #normalize convGrate to rate of all mutations + #Proportion Of Relevant Conversions + if totalmutrate == 'NA': + porc = 'NA' + elif totalmutrate > 0: + try: + porc = np.log2(convGrate / totalmutrate) + except: + porc = 'NA' + else: + porc = 'NA' + + #Format numbers for printing + if type(numreads) == float: + numreads = '{:.2f}'.format(numreads) + if type(convG) == float: + convG = '{:.2f}'.format(convG) + if type(totalG) == float: + totalG = '{:.2f}'.format(totalG) + if type(convGrate) == float: + convGrate = '{:.2e}'.format(convGrate) + if type(g_trate) == float: + g_trate = '{:.2e}'.format(g_trate) + if type(g_crate) == float: + g_crate = '{:.2e}'.format(g_crate) + if type(g_xrate) == float: + g_xrate = '{:.2e}'.format(g_xrate) + if type(ng_xgrate) == float: + ng_xgrate = '{:.2e}'.format(ng_xgrate) + if type(porc) == np.float64: + porc = '{:.3f}'.format(porc) + + outfh.write(('\t').join([gene, genename, str(numreads)] + convcounts + [str(totalG), str(convG), str(convGrate), str(g_trate), str(g_crate), str(g_xrate), str(ng_xgrate), str(porc)]) + '\n') + +if __name__ == '__main__': + print('Getting posterior probabilities from salmon alignment file...') + pprobs = getpostmasterassignments(sys.argv[1]) + print('Done!') + print('Loading conversions from pickle file...') + with open(sys.argv[2], 'rb') as infh: + convs = pickle.load(infh) + print('Done!') + print('Assinging conversions to transcripts...') + txconvs = assigntotxs(pprobs, convs) + print('Done!') + + tx2gene, geneid2genename, geneconvs = collapsetogene(txconvs, sys.argv[3]) + genecounts = readspergene(sys.argv[4], tx2gene) + writeOutput(geneconvs, genecounts, geneid2genename, sys.argv[5], True, True) \ No newline at end of file diff --git a/src/pigpen/bacon_glm.py b/src/pigpen/bacon_glm.py new file mode 100644 index 0000000..467a3e1 --- /dev/null +++ b/src/pigpen/bacon_glm.py @@ -0,0 +1,417 @@ +#Bioinformatic Analysis of the Conversion Of Nucleotides (BACON) + +#Use a linear model to identify genes whose rate of G-conversions changes across conditions +#Going to end up with a lot of contingency tables: + +# converted notConverted +#G +#nonG + +#G conversions are only allowed to be G->T and G->C +#Compare groups of contingency tables across conditions +import pandas as pd +import sys +import numpy as np +import itertools +from statsmodels.stats.multitest import multipletests +from rpy2.robjects.packages import importr +from rpy2.robjects import pandas2ri, Formula, FloatVector +from rpy2.rinterface_lib.embedded import RRuntimeError +from rpy2.rinterface import RRuntimeWarning +from rpy2.rinterface_lib.callbacks import logger as rpy2_logger +import logging +import warnings +import argparse +utils = importr('utils') +lme4 = importr('lme4') +base = importr('base') +stats = importr('stats') + +parser = argparse.ArgumentParser(description = 'BACON: A framework for analyzing pigpen outputs.') +parser.add_argument('--sampconds', type=str, + help='3 column, tab delimited file. Column names must be \'file\', \'sample\', and \'condition\'. See README for more details.') +parser.add_argument('--minreads', type = int, help = 'Minimum read count for a gene to be considered in a sample.', default = 100) +parser.add_argument('--conditionA', type=str, + help='One of the two conditions in the \'condition\' column of sampconds. Deltaporc is defined as conditionB - conditionA.') +parser.add_argument('--conditionB', type=str, + help='One of the two conditions in the \'condition\' column of sampconds. Deltaporc is defined as conditionB - conditionA.') +parser.add_argument('--use_g_t', help = 'Consider G to T mutations in contingency table?', action = 'store_true') +parser.add_argument('--use_g_c', help = 'Consider G to C mutations in contingency table?', action = 'store_true') +parser.add_argument('--use_g_x', help = 'Consider G to deletion mutations in contingency table?', action = 'store_true') +parser.add_argument('--use_ng_xg', help = 'Consider NG to deletionG mutations in contingency table?', action = 'store_true') +parser.add_argument('--considernonG', + help='Consider conversions of nonG residues to normalize for overall mutation rate?', action = 'store_true') +parser.add_argument('--output', type = str, help = 'Output file.') +args = parser.parse_args() + +#Need r-base, r-stats, r-lme4 + +#supress RRuntimeWarnings +warnings.filterwarnings('ignore', category = RRuntimeWarning) +rpy2_logger.setLevel(logging.ERROR) #suppresses R messages to console + + + +def readconditions(samp_conds_file): + #Read in a three column file of oincoutputfile / sample / condition + #Return a pandas df + sampconddf = pd.read_csv(samp_conds_file, sep = '\t', index_col = False, header = 0) + + return sampconddf + +def makePORCdf(samp_conds_file, minreads, considernonG): + #Make a dataframe of PORC values for all samples + #start with GENE...SAMPLE...READCOUNT...PORC + #then make a wide version that is GENE...SAMPLE1READCOUNT...SAMPLE1PORC...SAMPLE2READCOUNT...SAMPLE2PORC + + #Only keep genes that are present in EVERY pigpen file + + minreads = int(minreads) + dfs = [] #list of dfs from individual pigpen outputs + genesinall = set() #genes present in every pigpen file + with open(samp_conds_file, 'r') as infh: + for line in infh: + line = line.strip().split('\t') + if line[0] == 'file': + continue + pigpenfile = line[0] + sample = line[1] + df = pd.read_csv(pigpenfile, sep = '\t', index_col = False, comment = '#', header = 0) + dfgenes = df['GeneID'].tolist() + samplecolumn = [sample] * len(dfgenes) + df = df.assign(sample = samplecolumn) + + if not genesinall: #if there are no genes in there (this is the first file) + genesinall = set(dfgenes) + else: + genesinall = genesinall.intersection(set(dfgenes)) + + columnstokeep = ['GeneID', 'GeneName', 'sample', 'numreads', 'G_Trate', 'G_Crate', 'G_Xrate', 'NG_XGrate', 'convGrate', 'porc'] + df = df[columnstokeep] + dfs.append(df) + + #for each df, filter keeping only the genes present in every df (genesinall) + #Somehow there are some genes whose name in NA + if np.nan in genesinall: + genesinall.remove(np.nan) + dfs = [df.loc[df['GeneID'].isin(genesinall)] for df in dfs] + + #concatenate (rbind) dfs together + df = pd.concat(dfs) + + #turn from long into wide + df = df.pivot_table(index=['GeneID', 'GeneName'], columns='sample', values=[ + 'numreads', 'G_Trate', 'G_Crate', 'G_Xrate', 'NG_XGrate', 'convGrate', 'porc']).reset_index() + #flatten multiindex column names + df.columns = ["_".join(a) if '' not in a else a[0] for a in df.columns.to_flat_index()] + + #Filter for genes with at least minreads in every sample + #get columns with numreads info + numreadsColumns = [col for col in df.columns if 'numreads' in col] + #Get minimum in those columns + df['minreadcount'] = df[numreadsColumns].min(axis = 1) + #Filter for rows with minreadcount >= minreads + print('Filtering for genes with at least {0} reads in every sample.'.format(minreads)) + df = df.loc[df['minreadcount'] >= minreads] + print('{0} genes have at least {1} reads in every sample.'.format(len(df), minreads)) + #We also don't want rows with inf/-inf PORC values + #This is true only if we are using porc values, otherwise we can keep them + if considernonG: + df = df.replace([np.inf, -np.inf], np.nan) + df = df.dropna(how = 'any') + #Return a dataframe of just genes and relevant values + columnstokeep = ['GeneID', 'GeneName'] + [col for col in df.columns if 'rate' in col] + [col for col in df.columns if 'porc' in col] + df = df[columnstokeep] + print('{0} genes pass read count filter in all files.'.format(len(df))) + + return df + +def calcDeltaPORC(porcdf, sampconds, conditionA, conditionB, metric): + #Given a porc df from makePORCdf, add delta metric values. + #Metric depends on whether we are considering nonG conversions (if so, metric = porc) + #and on what conversion we are considering (g_t, g_c, or if both, metric = convGrate) + + deltametrics = [] + sampconddf = readconditions(sampconds) + + #Get column names in porcdf that are associated with each condition + conditionAsamps = sampconddf.loc[sampconddf['condition'] == conditionA] + conditionAsamps = conditionAsamps['sample'].tolist() + conditionAcolumns = [metric + '_' + samp for samp in conditionAsamps] + + conditionBsamps = sampconddf.loc[sampconddf['condition'] == conditionB] + conditionBsamps = conditionBsamps['sample'].tolist() + conditionBcolumns = [metric + '_' + samp for samp in conditionBsamps] + + print('Condition A samples: ' + (', ').join(conditionAsamps)) + print('Condition B samples: ' + (', ').join(conditionBsamps)) + + for index, row in porcdf.iterrows(): + condAmetrics = [] + condBmetrics = [] + for col in conditionAcolumns: + value = row[col] + condAmetrics.append(value) + for col in conditionBcolumns: + value = row[col] + condBmetrics.append(value) + + condAmetrics = [x for x in condAmetrics if x != np.nan] + condBmetrics = [x for x in condBmetrics if x != np.nan] + condAmetric = np.mean(condAmetrics) + condBmetric = np.mean(condBmetrics) + + deltametric = condBmetric - condAmetric #remember that porc is logged, but the raw conversion rates are not + if metric == 'porc': + deltametric = float(format(deltametric, '.3f')) + else: + deltametric = '{:.3e}'.format(deltametric) + deltametrics.append(deltametric) + + if metric == 'porc': + porcdf = porcdf.assign(delta_porc = deltametrics) + elif metric == 'convGrate': + porcdf = porcdf.assign(delta_convGrate = deltametrics) + + #Only take same columns + columnstokeep = ['GeneID', 'GeneName'] + [col for col in porcdf.columns if metric in col] + porcdf = porcdf[columnstokeep] + + return porcdf + +def makeContingencyTable(row, use_g_t, use_g_c, use_g_x, use_ng_xg): + #Given a row from a pigpen df, return a contingency table of the form + #[[convG, nonconvG], [convnonG, nonconvnonG]] + + #Identity of converted Gs depends on which ones you want to count + possibleconvs = ['g_t', 'g_c', 'g_x', 'ng_xg'] + possibleoptions = [use_g_t, use_g_c, use_g_x, use_ng_xg] + selectedconvs = [] + for ind, option in enumerate(possibleoptions): + if option == True: + selectedconvs.append(possibleconvs[ind]) + + convG = 0 + for conv in selectedconvs: + convG += row[conv] + + nonconvG = row['g_g'] + convnonG = row['a_t'] + row['a_c'] + row['a_g'] + row['a_x'] + row['c_a'] + row['c_t'] + row['c_g'] + row['c_x'] + row['t_a'] + row['t_c'] + row['t_g'] + row['t_x'] + nonconvnonG = row['c_c'] + row['t_t'] + row['a_a'] + + conttable = [[convG, nonconvG], [convnonG, nonconvnonG]] + + return conttable + + +def calculate_nested_f_statistic(small_model, big_model): + #Given two fitted GLMs, the larger of which contains the parameter space of the smaller, return the F Stat and P value corresponding to the larger model adding explanatory power + #No anova test for GLM models in python + #this is a workaround + #https://stackoverflow.com/questions/27328623/anova-test-for-glm-in-python + addtl_params = big_model.df_model - small_model.df_model + f_stat = (small_model.deviance - big_model.deviance) / \ + (addtl_params * big_model.scale) + df_numerator = addtl_params + # use fitted values to obtain n_obs from model object: + df_denom = (big_model.fittedvalues.shape[0] - big_model.df_model) + p_value = stats.f.sf(f_stat, df_numerator, df_denom) + return (f_stat, p_value) + +def getgenep(geneconttable, considernonG): + #Given a gene-level contingency table of the form: + #[condAtables, condBtables], where each individual sample is of the form + #[[convG, nonconvG], [convnonG, nonconvnonG]], + #run lme either including or excluding condition term + #using likelihood ratio of the two models and chisq test, return p value + + #Turn gene-level contingency tables into df of form + #convcount nonconvcount condition nuc sample + nCondAsamples = len(geneconttable[0]) + nCondBsamples = len(geneconttable[1]) + + # e.g. ['condA', 'condA', 'condB', 'condB'] + cond = ['condA'] * (nCondAsamples * 4) + ['condB'] * (nCondBsamples * 4) + nuc = ['G', 'G', 'nonG', 'nonG'] * nCondAsamples + ['G', 'G', 'nonG', + 'nonG'] * nCondBsamples # e.g. ['G', 'G', 'nonG', 'nonG', ...] + conv = ['yes', 'no', 'yes', 'no'] * nCondAsamples + ['yes', 'no', + 'yes', 'no'] * nCondBsamples # e.g. ['yes', 'no', 'yes', 'no', ...] + samples = ['sample' + str(x + 1) for x in range(nCondAsamples + nCondBsamples)] + samples = list(itertools.repeat(samples, 4)) + samples = list(itertools.chain.from_iterable(samples)) + samples = sorted(samples) + samplenumber = [x + 1 for x in range(len(cond))] + + #to get counts, flatten the very nested list of lists that is geneconttable + a = list(itertools.chain.from_iterable(geneconttable)) + b = list(itertools.chain.from_iterable(a)) + counts = list(itertools.chain.from_iterable(b)) + + d = {'cond': cond, 'nuc': nuc, 'conv': conv, + 'counts': counts, 'sample' : samples} + df = pd.DataFrame.from_dict(d) + + if considernonG: + #Reshape table to get individual columns for converted and nonconverted nts + df2 = df.pivot_table(index=['cond', 'nuc', 'sample'], + columns='conv', values='counts').reset_index() + + pandas2ri.activate() + + fmla = 'cbind(yes, no) ~ sample + nuc + cond + nuc:cond' + nullfmla = 'cbind(yes, no) ~ sample + nuc' + + fullfit = stats.glm(formula=fmla, family=stats.binomial, data=df2) + reducedfit = stats.glm(formula=nullfmla, family=stats.binomial, data=df2) + + logratio = (stats.logLik(fullfit)[0] - stats.logLik(reducedfit)[0]) * 2 + pvalue = stats.pchisq(logratio, df=2, lower_tail=False)[0] + #format decimal + pvalue = float('{:.2e}'.format(pvalue)) + + return pvalue + + elif not considernonG: + #Remove rows in which nuc == nonG + df = df[df.nuc == 'G'] + + #Reshape table to get individual columns for converted and nonconverted nts + df2 = df.pivot_table(index=['cond', 'nuc', 'sample'], + columns='conv', values='counts').reset_index() + + pandas2ri.activate() + fmla = 'cbind(yes, no) ~ cond' + nullfmla = 'cbind(yes, no) ~ 1' + + fullfit = stats.glm(formula=fmla, family=stats.binomial, data=df2) + reducedfit = stats.glm(formula=nullfmla, family=stats.binomial, data=df2) + + logratio = (stats.logLik(fullfit)[0] - stats.logLik(reducedfit)[0]) * 2 + pvalue = stats.pchisq(logratio, df=1, lower_tail=False)[0] + #format decimal + pvalue = float('{:.2e}'.format(pvalue)) + + return pvalue + +def multihyp(pvalues): + #given a dictionary of {gene : pvalue}, perform multiple hypothesis correction + + #remove genes with p value of NA + cleanedp = {} + for gene in pvalues: + if not np.isnan(pvalues[gene]): + cleanedp[gene] = pvalues[gene] + + cleanedps = list(cleanedp.values()) + fdrs = multipletests(cleanedps, method = 'fdr_bh')[1] + fdrs = dict(zip(list(cleanedp.keys()), fdrs)) + + correctedps = {} + for gene in pvalues: + if np.isnan(pvalues[gene]): + correctedps[gene] = np.nan + else: + correctedps[gene] = fdrs[gene] + + return correctedps + + +def getpvalues(samp_conds_file, conditionA, conditionB, considernonG, filteredgenes, use_g_t, use_g_c, use_g_x, use_ng_xg): + #each contingency table will be: [[convG, nonconvG], [convnonG, nonconvnonG]] + #These will be stored in a dictionary: {gene : [condAtables, condBtables]} + conttables = {} + + pvalues = {} #{gene : pvalue} + + nsamples = 0 + with open(samp_conds_file, 'r') as infh: + for line in infh: + line = line.strip().split('\t') + if line[0] == 'file': + continue + nsamples +=1 + pigpenfile = line[0] + sample = line[1] + condition = line[2] + df = pd.read_csv(pigpenfile, sep = '\t', index_col = False, header=0, comment = '#') + for idx, row in df.iterrows(): + conttable = makeContingencyTable(row, use_g_t, use_g_c, use_g_x, use_ng_xg) + gene = row['GeneID'] + #If this isn't one of the genes that passes read count filters in all files, skip it + if gene not in filteredgenes: + continue + if gene not in conttables: + conttables[gene] = [[], []] + if condition == conditionA: + conttables[gene][0].append(conttable) + elif condition == conditionB: + conttables[gene][1].append(conttable) + + genecounter = 0 + for gene in conttables: + genecounter +=1 + if genecounter % 1000 == 0: + print('Getting p value for gene {0}...'.format(genecounter)) + geneconttable = conttables[gene] + #Only calculate p values for genes present in every porc file + gene_porcfiles = len(geneconttable[0]) + len(geneconttable[1]) + if nsamples == gene_porcfiles: + try: + p = getgenep(geneconttable, considernonG) + except RRuntimeError: + p = np.nan + else: + p = np.nan + + pvalues[gene] = p + + correctedps = multihyp(pvalues) + + pdf = pd.DataFrame.from_dict(pvalues, orient = 'index', columns = ['pval']) + fdrdf = pd.DataFrame.from_dict(correctedps, orient = 'index', columns = ['FDR']) + + pdf = pd.merge(pdf, fdrdf, left_index = True, right_index = True).reset_index().rename(columns = {'index' : 'GeneID'}) + + return pdf + +def formatporcdf(porcdf): + #Format floats in all porcDF columns + formats = {'pval': '{:.3e}', 'FDR': '{:.3e}'} + #c = porcdf.columns.tolist() + #c = [x for x in c if 'porc_' in x] + #for x in c: + # formats[x] = '{:.3f}' #all porc_SAMPLE columns + for col, f in formats.items(): + porcdf[col] = porcdf[col].map(lambda x: f.format(x)) + + return porcdf + +def main(): + #Considering nonG conversions uses porc values in which both g_t and g_c conversions have already been included + if not args.use_g_t and not args.use_g_c and not args.considernonG: + print('ERROR: we must either count G to T or G to C mutations (or both) or consider nonG conversions.') + sys.exit() + + #What metric should we care about? + if args.considernonG: + metric = 'porc' + else: + metric = 'convGrate' #what goes into this rate is set in the corresponding pigpen run + + #Make df of PORC values + porcdf = makePORCdf(args.sampconds, args.minreads, args.considernonG) + #Add deltaporc values + porcdf = calcDeltaPORC(porcdf, args.sampconds, args.conditionA, args.conditionB, metric) + filteredgenes = porcdf['GeneID'].tolist() + #Get p values and corrected p values + pdf = getpvalues(args.sampconds, args.conditionA, args.conditionB, args.considernonG, filteredgenes, args.use_g_t, args.use_g_c, args.use_g_x, args.use_ng_xg) + #add p values and FDR + porcdf = pd.merge(porcdf, pdf, on = ['GeneID']) + #Format floats + porcdf = formatporcdf(porcdf) + + porcdf.to_csv(args.output, sep = '\t', index = False) + +if __name__ == '__main__': + main() + diff --git a/src/pigpen/bacon_subsample.py b/src/pigpen/bacon_subsample.py new file mode 100644 index 0000000..6f5a7bb --- /dev/null +++ b/src/pigpen/bacon_subsample.py @@ -0,0 +1,141 @@ +#As a statistical framework for identifying genes with differing 8OG-mediated conversion rates +#across conditions, use a subsampling approach. For each gene, subsample the reads assigned to it, +#calculating a porc value for each subsample. So then you end up with a distribution of porc +#values for each gene in each sample. + +#This relies on two pickled dictionaries produced by pigpen.py: read2gene.pkl and readconvs.pkl. +#The first is a dictionary of the form {readid : ensembl_gene_id}. +#The second is a dictionary of the form {readid : {convs}} +#where {convs} is of the form {x_y : count} where x is the reference sequence and y is the query sequence. +#x is one of 'agct' and y is one of 'agctn'. + +#Then, using Hoteling t-test, compare distributions of porc values across conditions. + +import pickle +import numpy as np +import random +import sys +from collections import defaultdict + +def getReadsperGene(read2genepkl): + #pigpen produces a dictionary of the form {readid : ensembl_gene_id} (assignreads.processOverlaps) + #It has written this dictionary to a pickled file. + #We need a dictionary of the form {ensembl_gene_id : [readids]} + + print('Loading gene/read assignments...') + with open(read2genepkl, 'rb') as infh: + read2gene = pickle.load(infh) + print('Done!') + + readspergene = {} #{ensembl_gene_id : [readids that belong to this gene]} + + readcount = 0 + for read in read2gene: + gene = read2gene[read] + if gene not in readspergene: + readspergene[gene] = [read] + else: + readspergene[gene].append(read) + + return readspergene + +def makeGeneConvdict(readspergene, readconvspkl): + #Want to make a dictionary that looks like this + #{gene : [{convs1}, {convs2}, ...]} where each conv dict corresponds to one read + #that has been assigned to this gene + + print('Loading read conversion info...') + with open(readconvspkl, 'rb') as infh: + readconvs = pickle.load(infh) + print('Done!') + + print('Making gene : read conversion dictionary...') + geneconv = defaultdict(list) + for gene in readspergene: + reads = readspergene[gene] + for read in reads: + try: + geneconv[gene].append(readconvs[read]) + except KeyError: + pass #a read for which we calculated conversions but didn't get assigned to any read + print('Done!') + + return geneconv + +def calcPORC(convs): + #accepting a list of conversion dictionaries + + convGcount = 0 + totalGcount = 0 + allconvcount = 0 + allnonconvcount = 0 + + for conv in convs: + convG = conv['g_t'] + conv['g_c'] + totalG = conv['g_t'] + conv['g_c'] + \ + conv['g_a'] + conv['g_n'] + conv['g_g'] + + allconv = conv['a_t'] + conv['a_c'] + conv['a_g'] + conv['g_t'] + conv['g_c'] + \ + conv['g_a'] + conv['t_a'] + conv['t_c'] + \ + conv['t_g'] + conv['c_t'] + conv['c_g'] + conv['c_a'] + \ + conv['a_n'] + conv['g_n'] + conv['c_n'] + conv['t_n'] + + allnonconv = conv['a_a'] + conv['g_g'] + conv['c_c'] + conv['t_t'] + + convGcount += convG + totalGcount += totalG + allconvcount += allconv + allnonconvcount += allnonconv + + allnt = allconvcount + allnonconvcount + try: + convGrate = convGcount / totalGcount + except ZeroDivisionError: + convGrate = np.nan + + try: + totalmutrate = allconvcount / allnt + except ZeroDivisionError: + totalmutrate = np.nan + + #Calculate porc + if totalmutrate == np.nan: + porc = np.nan + elif totalmutrate > 0: + try: + porc = np.log2(convGrate / totalmutrate) + except: + porc = np.nan + else: + porc = np.nan + + return porc + +def subsamplegeneconv(geneconv, subsamplesize, n_subsamples): + subsampledporcs = defaultdict(list) # {ensembl_gene_id : [porc values]} + + for gene in geneconv: + print(gene) + convs = geneconv[gene] + nreads = len(convs) + n_readstosubsample = int(nreads * subsamplesize) + for i in range(n_subsamples): + subsampledconvs = random.sample(convs, n_readstosubsample) + porc = calcPORC(subsampledconvs) + print(porc) + subsampledporcs[gene].append(porc) + + return subsampledporcs + +#Take in a sampconds, calculate subsamples for each, end up with a dictionary like so: +#{gene : {condition : [[subsampled porcs sample 1], [subsampled porcs sample 2], ...]}} + + + + +if __name__ == '__main__': + + readspergene = getReadsperGene(sys.argv[1]) + geneconv = makeGeneConvdict(readspergene, sys.argv[2]) + subsamplegeneconv(geneconv, 0.3, 100) + diff --git a/conversionsPerGene.py b/src/pigpen/conversionsPerGene.py similarity index 75% rename from conversionsPerGene.py rename to src/pigpen/conversionsPerGene.py index e914b59..52627d1 100644 --- a/conversionsPerGene.py +++ b/src/pigpen/conversionsPerGene.py @@ -24,7 +24,8 @@ def getPerGene(convs, reads2gene): 'a_a', 'a_t', 'a_c', 'a_g', 'a_n', 'g_a', 'g_t', 'g_c', 'g_g', 'g_n', 'c_a', 'c_t', 'c_c', 'c_g', 'c_n', - 't_a', 't_t', 't_c', 't_g', 't_n'] + 't_a', 't_t', 't_c', 't_g', 't_n', + 'a_x', 'g_x', 'c_x', 't_x', 'ng_xg'] #It's possible (but relatively rare) for a read to be in convs but #not in reads2gene (or vice versa). Filter for reads only present in both. @@ -36,7 +37,6 @@ def getPerGene(convs, reads2gene): convs = {key:value for (key, value) in convs.items() if key in commonreads} reads2gene = {key:value for (key, value) in reads2gene.items() if key in commonreads} - for gene in geneids: convsPerGene[gene] = {} for conv in possibleconvs: @@ -51,17 +51,21 @@ def getPerGene(convs, reads2gene): return numreadspergene, convsPerGene -def writeConvsPerGene(numreadspergene, convsPerGene, outfile): +def writeConvsPerGene(sampleparams, numreadspergene, convsPerGene, outfile, use_g_t, use_g_c, use_g_x, use_ng_xg): possibleconvs = [ 'a_a', 'a_t', 'a_c', 'a_g', 'a_n', 'g_a', 'g_t', 'g_c', 'g_g', 'g_n', 'c_a', 'c_t', 'c_c', 'c_g', 'c_n', - 't_a', 't_t', 't_c', 't_g', 't_n'] + 't_a', 't_t', 't_c', 't_g', 't_n', + 'a_x', 'g_x', 'c_x', 't_x', 'ng_xg'] with open(outfile, 'w') as outfh: + #Write arguments for this pigpen run + for arg in sampleparams: + outfh.write('#' + arg + '\t' + str(sampleparams[arg]) + '\n') #total G is number of ref Gs encountered - #convG is g_t + g_c (the ones we are interested in) - outfh.write(('\t').join(['Gene', 'numreads'] + possibleconvs + ['totalG', 'convG', 'convGrate', 'G_Trate', 'G_Crate', 'porc']) + '\n') + #convG is g_t + g_c + g_x + ng_xg (the ones we are interested in) + outfh.write(('\t').join(['Gene', 'numreads'] + possibleconvs + ['totalG', 'convG', 'convGrate', 'G_Trate', 'G_Crate', 'G_Xrate', 'NG_XGrate', 'porc']) + '\n') genes = sorted(convsPerGene.keys()) for gene in genes: @@ -74,12 +78,19 @@ def writeConvsPerGene(numreadspergene, convsPerGene, outfile): convcounts = [str(x) for x in convcounts] - totalG = c['g_g'] + c['g_c'] + c['g_t'] + c['g_a'] + c['g_n'] - convG = c['g_c'] + c['g_t'] + totalG = c['g_g'] + c['g_c'] + c['g_t'] + c['g_a'] + c['g_n'] + c['g_x'] + convG = 0 + possiblegconv = ['g_t', 'g_c', 'g_x', 'ng_xg'] + for ind, x in enumerate([use_g_t, use_g_c, use_g_x, use_ng_xg]): + if x == True: + convG += c[possiblegconv[ind]] + g_ccount = c['g_c'] g_tcount = c['g_t'] + g_xcount = c['g_x'] + ng_xgcount = c['ng_xg'] - totalmut = c['a_t'] + c['a_c'] + c['a_g'] + c['g_t'] + c['g_c'] + c['g_a'] + c['t_a'] + c['t_c'] + c['t_g'] + c['c_t'] + c['c_g'] + c['c_a'] + totalmut = c['a_t'] + c['a_c'] + c['a_g'] + c['g_t'] + c['g_c'] + c['g_a'] + c['t_a'] + c['t_c'] + c['t_g'] + c['c_t'] + c['c_g'] + c['c_a'] + c['g_x'] + c['ng_xg'] totalnonmut = c['a_a'] + c['g_g'] + c['c_c'] + c['t_t'] allnt = totalmut + totalnonmut @@ -98,6 +109,16 @@ def writeConvsPerGene(numreadspergene, convsPerGene, outfile): except ZeroDivisionError: g_trate = 'NA' + try: + g_xrate = g_xcount / totalG + except ZeroDivisionError: + g_xrate = 'NA' + + try: + ng_xgrate = ng_xgcount / totalG + except ZeroDivisionError: + ng_xgrate = 'NA' + try: totalmutrate = totalmut / allnt except ZeroDivisionError: @@ -122,10 +143,14 @@ def writeConvsPerGene(numreadspergene, convsPerGene, outfile): g_trate = '{:.2e}'.format(g_trate) if type(g_crate) == float: g_crate = '{:.2e}'.format(g_crate) + if type(g_xrate) == float: + g_xrate = '{:.2e}'.format(g_xrate) + if type(ng_xgrate) == float: + ng_xgrate = '{:.2e}'.format(ng_xgrate) if type(porc) == np.float64: porc = '{:.3f}'.format(porc) - outfh.write(('\t').join([gene, str(numreads)] + convcounts + [str(totalG), str(convG), str(convGrate), str(g_trate), str(g_crate), str(porc)]) + '\n') + outfh.write(('\t').join([gene, str(numreads)] + convcounts + [str(totalG), str(convG), str(convGrate), str(g_trate), str(g_crate), str(g_xrate), str(ng_xgrate), str(porc)]) + '\n') if __name__ == '__main__': diff --git a/filterbam.py b/src/pigpen/filterbam.py similarity index 100% rename from filterbam.py rename to src/pigpen/filterbam.py diff --git a/src/pigpen/getmismatches.py b/src/pigpen/getmismatches.py new file mode 100644 index 0000000..23abdb7 --- /dev/null +++ b/src/pigpen/getmismatches.py @@ -0,0 +1,831 @@ +import pysam +import os +import re +import sys +from collections import defaultdict +import numpy as np +import pandas as pd +from pigpen.snps import getSNPs, recordSNPs +import pickle +import multiprocessing as mp +import subprocess + + +def revcomp(nt): + revcompdict = { + 'G' : 'C', + 'C' : 'G', + 'A' : 'T', + 'T' : 'A', + 'N' : 'N', + 'g' : 'c', + 'c' : 'g', + 'a' : 't', + 't' : 'a', + 'n' : 'n', + 'X': 'X', + None: None, + 'NA': 'NA' + } + + nt_rc = revcompdict[nt] + + return nt_rc + + +def iteratereads_singleend(bam, use_g_t, use_g_c, use_g_x, use_ng_xg, nConv, minMappingQual, minPhred=30, snps = None, maskpositions = None, verbosity = 'high'): + #Read through a bam containing single end reads (or if it contains paired end reads, just use read 1) + #Find nt conversion locations for each read. + #Store the number of each conversion for each read in a dictionary. + + #Bam must contain MD tag. + + readcounter = 0 + convs = {} #{readid : dictionary of all conversions} + save = pysam.set_verbosity(0) + with pysam.AlignmentFile(bam, 'r') as infh: + if verbosity == 'high': + print('Finding nucleotide conversions in {0}...'.format(os.path.basename(bam))) + for read in infh.fetch(until_eof = True): + if read.is_read2 or read.is_secondary or read.is_supplementary or read.is_unmapped or read.mapping_quality < minMappingQual: + continue + readcounter +=1 + if readcounter % 10000000 == 0: + if verbosity == 'high': + print('Finding nucleotide conversions in read {0}...'.format(readcounter)) + + queryname = read.query_name + queryseq = read.query_sequence #this is always on the + strand, no matter what strand the read maps to + chrm = read.reference_name + qualities = list(read.query_qualities) #phred scores + + #Get a set of snp locations if we have them + if snps: + if chrm in snps: + snplocations = snps[chrm] #set of coordinates to mask + else: + snplocations = None + else: + snplocations = None + + #Get a set of locations to mask if we have them + if maskpositions: + if chrm in maskpositions: + # set of coordinates to manually mask + masklocations = maskpositions[chrm] + else: + masklocations = None + else: + masklocations = None + + #combine snps and manually masked positions into one set + #this combined set will be masklocations + if snplocations and masklocations: + masklocations.update(snplocations) + elif snplocations and not masklocations: + masklocations = snplocations + elif masklocations and not snplocations: + masklocations = masklocations + + if read.is_reverse: + strand = '-' + elif not read.is_reverse: + strand = '+' + + alignedpairs = read.get_aligned_pairs(with_seq = True) + readqualities = list(read.query_qualities) + convs_in_read = getmismatches_singleend(alignedpairs, queryseq, readqualities, strand, masklocations, nConv, minPhred, use_g_t, use_g_c, use_g_x, use_ng_xg) + + convs[queryname] = convs_in_read + + if verbosity == 'high': + print('Queried {0} read pairs in {1}.'.format(readcounter, os.path.basename(bam))) + + #Pickle and write convs? + return convs, readcounter + + +def getmismatches_singleend(alignedpairs, queryseq, readqualities, strand, masklocations, nConv, minPhred, use_g_t, use_g_c, use_g_x, use_ng_xg): + #remove tuples that have None + #These are either intronic or might have been soft-clipped + #Tuples are (querypos, (0-based) refpos, refsequence) + #If there is a substitution, refsequence is lower case + + #refnt and querynt, as supplied by pysam, are always + strand + #everything here is assuming read is on sense strand + + #masklocations is a set of chrm_coord locations. At these locations, all queries will be treated as not having a conversion + + #For now, forget insertions. Get rid of any position where reference position is None + alignedpairs = [x for x in alignedpairs if x[1] != None] + alignedpairs = [x for x in alignedpairs if x[2] != None] + + #Add quality scores to alignedpairs tuples + #will now be (querypos, refpos, refseqeunce, querysequence, qualscore) + for x in range(len(alignedpairs)): + alignedpair = alignedpairs[x] + querypos = alignedpair[0] + if querypos != None: + querynt = queryseq[querypos] + qualscore = readqualities[querypos] + elif querypos == None: + querynt = 'X' #there is no query nt for a deletion + qualscore = 37 #there's no query position here, so make up a quality score + alignedpair = alignedpair + (querynt, qualscore) + alignedpairs[x] = alignedpair + + #if we have locations to mask, remove their locations from alignedpairs + #masklocations is a set of 0-based coordinates of snp locations to mask + if masklocations: + alignedpairs = [x for x in alignedpairs if x[1] not in masklocations] + + convs = {} #counts of conversions x_y where x is reference sequence and y is query sequence + + possibleconvs = [ + 'a_a', 'a_t', 'a_c', 'a_g', 'a_n', + 'g_a', 'g_t', 'g_c', 'g_g', 'g_n', + 'c_a', 'c_t', 'c_c', 'c_g', 'c_n', + 't_a', 't_t', 't_c', 't_g', 't_n', + 'a_x', 'g_x', 'c_x', 't_x', 'ng_xg'] + + #initialize dictionary + for conv in possibleconvs: + convs[conv] = 0 + + #Reorganize alignedpairs (which is currently list of tuples) into a dict + #alignedpair = (querypos, refpos, refsequence, querysequence, qualscore) + r1dict = {} #{reference position: [queryposition, reference sequence, querysequence, quality]} + for x in alignedpairs: + r1dict[int(x[1])] = [x[0], x[2].upper(), x[3].upper(), x[4]] + + #Now go through r1dict, looking for conversions. + #We are now keeping track of deletions as either g_x (reference G, query deletion) or ng_xg (ref nt 5' of G deleted in query) + #We have observed that sometimes RT skips the nucleotide *after* an oxidized G (after being from the RT's point of view) + + for refpos in r1dict: + conv = None + conv2 = None #sometimes we can have 2 convs (for example the first nt of ng_xg could be both g_x and ng_xg) + querypos = r1dict[refpos][0] + refseq = r1dict[refpos][1] + queryseq = r1dict[refpos][2] + qualscore = r1dict[refpos][3] + + if strand == '-': + #refseq needs to equal the sense strand (it is always initially defined as the + strand). read1 is always the sense strand. + refseq = revcomp(refseq) + + #If reference is N, skip this position + if refseq == 'N' or refseq == 'n': + continue + conv = refseq.lower() + '_' + queryseq.lower() + + if queryseq == 'X': + #Check if there is a reference G downstream of this position + if strand == '+': + downstreamrefpos = refpos + 1 + #It's possible that downstreamrefpos is not in mergedalignedpairs because this position is at the end of the read + try: + downstreamrefseq = r1dict[downstreamrefpos][1].upper() + downstreamqueryseq = r1dict[downstreamrefpos][2].upper() + except KeyError: + downstreamrefseq, downstreamqueryseq = None, None + elif strand == '-': + downstreamrefpos = refpos - 1 + try: + downstreamrefseq = r1dict[downstreamrefpos][1].upper() + downstreamqueryseq = r1dict[downstreamrefpos][2].upper() + except KeyError: + downstreamrefseq, downstreamqueryseq = None, None + downstreamrefseq = revcomp(downstreamrefseq) + downstreamqueryseq = revcomp(downstreamqueryseq) + if downstreamrefseq == 'G': + conv2 = 'ng_xg' + #If this is a non-g deletion and is downstream of a g, we can't be sure if this deletion is due to this nucleotide or the downstream g + if conv in ['a_x', 't_x', 'c_x']: + conv = None + + #Add conv(s) to dictionary + #Only do conv2 (ng_xg) if conv in not g_x + if qualscore >= minPhred: + if conv in convs: + convs[conv] +=1 + if conv2 == 'ng_xg' and conv != 'g_x': + convs[conv2] +=1 + + #Does the number of relevant conversions meet our threshold? + allconvs = ['g_c', 'g_t', 'g_x', 'ng_xg'] + convoptions = [use_g_c, use_g_t, use_g_x, use_ng_xg] + selectedconvs = [] + for ind, x in enumerate(allconvs): + if convoptions[ind] == True: + selectedconvs.append(x) + if not selectedconvs: + print('ERROR: we must be looking for at least one conversion type.') + sys.exit() + + nConv_in_read = 0 + for x in selectedconvs: + nConv_in_read += convs[x] + if nConv_in_read < nConv: + for x in allconvs: + convs[x] = 0 + + return convs + +def read_pair_generator(bam, region_string=None): + """ + Generate read pairs in a BAM file or within a region string. + Reads are added to read_dict until a pair is found. + https://www.biostars.org/p/306041/ + """ + read_dict = defaultdict(lambda: [None, None]) + for read in bam: + if not read.is_proper_pair or read.is_secondary or read.is_supplementary or read.mate_is_unmapped or read.is_unmapped: + continue + qname = read.query_name + if qname not in read_dict: + if read.is_read1: + read_dict[qname][0] = read + else: + read_dict[qname][1] = read + else: + if read.is_read1: + yield read, read_dict[qname][1] + else: + yield read_dict[qname][0], read + del read_dict[qname] + +def findsnps(controlbams, genomefasta, minCoverage = 20, minVarFreq = 0.02): + controlbams = controlbams.split(',') + getSNPs(controlbams, genomefasta, minCoverage, minVarFreq) + snps = recordSNPs('merged.vcf') + + return snps + + +def iteratereads_pairedend(bam, onlyConsiderOverlap, use_g_t, use_g_c, use_g_x, use_ng_xg, use_read1, use_read2, nConv, minMappingQual, minPhred=30, snps=None, maskpositions=None, verbosity='high'): + #Iterate over reads in a paired end alignment file. + #Find nt conversion locations for each read. + #For locations interrogated by both mates of read pair, conversion must exist in both mates in order to count + #Store the number of conversions for each read in a dictionary + + #Quality score array is always in the same order as query_sequence, which is always on the + strand + #Bam must contain MD tags + + if onlyConsiderOverlap == 'True': + onlyConsiderOverlap = True + elif onlyConsiderOverlap == 'False': + onlyConsiderOverlap = False + + queriednts = [] + readcounter = 0 + convs = {} #{readid : dictionary of all conversions} + save = pysam.set_verbosity(0) + with pysam.AlignmentFile(bam, 'r') as infh: + if verbosity == 'high': + print('Finding nucleotide conversions in {0}...'.format(os.path.basename(bam))) + for read1, read2 in read_pair_generator(infh): + + #Just double check that the pairs are matched + if read1.query_name != read2.query_name: + continue + + #Check mapping quality + #MapQ is 255 for uniquely aligned reads FOR STAR ONLY + if read1.mapping_quality < int(minMappingQual) or read2.mapping_quality < int(minMappingQual): + continue + + readcounter +=1 + if readcounter % 1000000 == 0: + if verbosity == 'high': + print('Finding nucleotide conversions in read {0}...'.format(readcounter)) + + queryname = read1.query_name + chrm = read1.reference_name + + #Get a set of positions to mask (snps + locations we want to mask) + #Get a set of snp locations if we have them + if snps: + if chrm in snps: + snplocations = snps[chrm] #set of snp coordinates to mask + else: + snplocations = None + else: + snplocations = None + + #Get a set of locations to mask if we have them + if maskpositions: + if chrm in maskpositions: + masklocations = maskpositions[chrm] #set of coordinates to manually mask + else: + masklocations = None + else: + masklocations = None + + #combine snps and manually masked positions into one set + #this combined set will be masklocations + if snplocations and masklocations: + masklocations.update(snplocations) + elif snplocations and not masklocations: + masklocations = snplocations + elif masklocations and not snplocations: + masklocations = masklocations + + read1queryseq = read1.query_sequence + read1alignedpairs = read1.get_aligned_pairs(with_seq = True) + if read1.is_reverse: + read1strand = '-' + elif not read1.is_reverse: + read1strand = '+' + + read2queryseq = read2.query_sequence + read2alignedpairs = read2.get_aligned_pairs(with_seq = True) + if read2.is_reverse: + read2strand = '-' + elif not read2.is_reverse: + read2strand = '+' + + read1qualities = list(read1.query_qualities) #phred scores + read2qualities = list(read2.query_qualities) + + convs_in_read = getmismatches_pairedend(read1alignedpairs, read2alignedpairs, read1queryseq, read2queryseq, read1qualities, read2qualities, read1strand, read2strand, masklocations, onlyConsiderOverlap, nConv, minPhred, use_g_t, use_g_c, use_g_x, use_ng_xg, use_read1, use_read2) + queriednts.append(sum(convs_in_read.values())) + convs[queryname] = convs_in_read + + if verbosity == 'high': + print('Queried {0} read pairs in {1}.'.format(readcounter, os.path.basename(bam))) + + pysam.set_verbosity(save) + #Pickle and write convs + #with open('convs.pkl', 'wb') as outfh: + #pickle.dump(convs, outfh) + + return convs, readcounter + + +def getmismatches_pairedend(read1alignedpairs, read2alignedpairs, read1queryseq, read2queryseq, read1qualities, read2qualities, read1strand, read2strand, masklocations, onlyoverlap, nConv, minPhred, use_g_t, use_g_c, use_g_x, use_ng_xg, use_read1, use_read2): + #remove tuples that have None + #These are either intronic or might have been soft-clipped + #Tuples are (querypos, refpos, refsequence) + #If there is a substitution, refsequence is lower case + + #For now, forget insertions. Get rid of any position where reference position is None. + read1alignedpairs = [x for x in read1alignedpairs if x[1] != None] + read2alignedpairs = [x for x in read2alignedpairs if x[1] != None] + read1alignedpairs = [x for x in read1alignedpairs if x[2] != None] + read2alignedpairs = [x for x in read2alignedpairs if x[2] != None] + + #Add quality scores and query sequences + #will now be (querypos, refpos, refsequence, querysequence, qualscore) + for x in range(len(read1alignedpairs)): + alignedpair = read1alignedpairs[x] + querypos = alignedpair[0] + if querypos != None: + querynt = read1queryseq[querypos] + qualscore = read1qualities[querypos] + elif querypos == None: + querynt = 'X' # there is no query nt for a deletion + qualscore = 37 # there's no query position here, so make up a quality score + alignedpair = alignedpair + (querynt, qualscore) + read1alignedpairs[x] = alignedpair + + for x in range(len(read2alignedpairs)): + alignedpair = read2alignedpairs[x] + querypos = alignedpair[0] + if querypos != None: + querynt = read2queryseq[querypos] + qualscore = read2qualities[querypos] + elif querypos == None: + querynt = 'X' + qualscore = 37 + alignedpair = alignedpair + (querynt, qualscore) + read2alignedpairs[x] = alignedpair + + #if we have locations to mask, remove their locations from read1alignedpairs and read2alignedpairs + #masklocations is a set of 0-based coordinates of snp locations to mask + if masklocations: + read1alignedpairs = [x for x in read1alignedpairs if x[1] not in masklocations] + read2alignedpairs = [x for x in read2alignedpairs if x[1] not in masklocations] + + convs = {} #counts of conversions x_y where x is reference sequence and y is query sequence + + possibleconvs = [ + 'a_a', 'a_t', 'a_c', 'a_g', 'a_n', + 'g_a', 'g_t', 'g_c', 'g_g', 'g_n', + 'c_a', 'c_t', 'c_c', 'c_g', 'c_n', + 't_a', 't_t', 't_c', 't_g', 't_n', + 'a_x', 'g_x', 'c_x', 't_x', 'ng_xg'] + + #initialize dictionary + for conv in possibleconvs: + convs[conv] = 0 + + #For locations interrogated by both mates of read pair, conversion must exist in both mates in order to count + #These locations (as defined by their reference positions) would be found both in read1alignedpairs and read2alignedpairs + + #Get the ref positions queried by the two reads + r1dict = {} #{reference position: [queryposition, reference sequence, querysequence, quality]} + r2dict = {} + for x in read1alignedpairs: + r1dict[int(x[1])] = [x[0], x[2].upper(), x[3].upper(), x[4]] + for x in read2alignedpairs: + r2dict[int(x[1])] = [x[0], x[2].upper(), x[3].upper(), x[4]] + + # {refpos : [R1querypos, R2querypos, R1refsequence, R2refsequence, R1querysequence, R2querysequence, R1quality, R2quality]} + mergedalignedpairs = {} + #For positions only in R1 or R2, querypos and refsequence are NA for the other read + for refpos in r1dict: + r1querypos = r1dict[refpos][0] + r1refseq = r1dict[refpos][1] + r1queryseq = r1dict[refpos][2] + r1quality = r1dict[refpos][3] + if refpos in mergedalignedpairs: #this should not be possible because we are looking at r1 first + r2querypos = mergedalignedpairs[refpos][1] + r2refseq = mergedalignedpairs[refpos][3] + r2queryseq = mergedalignedpairs[refpos][5] + r2quality = mergedalignedpairs[refpos][7] + mergedalignedpairs[refpos] = [r1querypos, r2querypos, r1refseq, + r2refseq, r1queryseq, r2queryseq, r1quality, r2quality] + else: + mergedalignedpairs[refpos] = [r1querypos, 'NA', r1refseq, 'NA', r1queryseq, 'NA', r1quality, 'NA'] + + for refpos in r2dict: + #same thing + r2querypos = r2dict[refpos][0] + r2refseq = r2dict[refpos][1] + r2queryseq = r2dict[refpos][2] + r2quality = r2dict[refpos][3] + if refpos in mergedalignedpairs: #if we saw it for r1 + r1querypos = mergedalignedpairs[refpos][0] + r1refseq = mergedalignedpairs[refpos][2] + r1queryseq = mergedalignedpairs[refpos][4] + r1quality = mergedalignedpairs[refpos][6] + mergedalignedpairs[refpos] = [r1querypos, r2querypos, r1refseq, + r2refseq, r1queryseq, r2queryseq, r1quality, r2quality] + else: + mergedalignedpairs[refpos] = ['NA', r2querypos, 'NA', r2refseq, 'NA', r2queryseq, 'NA', r2quality] + + #If we are only using read1 or only using read2, replace the positions in the non-used read with NA + for refpos in mergedalignedpairs: + r1querypos, r2querypos, r1refseq, r2refseq, r1queryseq, r2queryseq, r1quality, r2quality = mergedalignedpairs[refpos] + if use_read1 and not use_read2: + updatedlist = [r1querypos, 'NA', r1refseq, + 'NA', r1queryseq, 'NA', r1quality, 'NA'] + mergedalignedpairs[refpos] = updatedlist + elif use_read2 and not use_read1: + updatedlist = ['NA', r2querypos, 'NA', + r2refseq, 'NA', r2queryseq, 'NA', r2quality] + mergedalignedpairs[refpos] = updatedlist + elif not use_read1 and not use_read2: + print('ERROR: we have to use either read1 or read2, if not both.') + sys.exit() + elif use_read1 and use_read2: + pass + + #Now go through mergedalignedpairs, looking for conversions. + #For positions observed both in r1 and r2, queryseq in both reads must match, otherwise the position is skipped. + #We are now keeping track of deletions as either g_x (reference G, query deletion) or ng_xg (ref nt 5' of G deleted in query) + #We have observed that sometimes RT skips the nucleotide *after* an oxidized G (after being from the RT's point of view) + + for refpos in mergedalignedpairs: + conv = None + conv2 = None #sometimes we can have 2 convs (for example the first nt of ng_xg could be both g_x and ng_xg) + r1querypos = mergedalignedpairs[refpos][0] + r2querypos = mergedalignedpairs[refpos][1] + r1refseq = mergedalignedpairs[refpos][2] + r2refseq = mergedalignedpairs[refpos][3] + r1queryseq = mergedalignedpairs[refpos][4] + r2queryseq = mergedalignedpairs[refpos][5] + r1quality = mergedalignedpairs[refpos][6] + r2quality = mergedalignedpairs[refpos][7] + + if r1querypos != 'NA' and r2querypos == 'NA': #this position queried by r1 only + if read1strand == '-': + #refseq needs to equal the sense strand (it is always initially defined as the + strand). read1 is always the sense strand. + r1refseq = revcomp(r1refseq) + r1queryseq = revcomp(r1queryseq) + + #If reference is N, skip this position + if r1refseq == 'N' or r1refseq == 'n': + continue + conv = r1refseq.lower() + '_' + r1queryseq.lower() + + if r1queryseq == 'X': + #Check if there is a reference G downstream of this position + if read1strand == '+': + downstreamrefpos = refpos + 1 + #It's possible that downstreamrefpos is not in mergedalignedpairs because this position is at the end of the read + try: + downstreamrefseq = mergedalignedpairs[downstreamrefpos][2].upper() + downstreamqueryseq = mergedalignedpairs[downstreamrefpos][4].upper() + except KeyError: + downstreamrefseq, downstreamqueryseq = None, None + elif read1strand == '-': + downstreamrefpos = refpos - 1 + try: + downstreamrefseq = mergedalignedpairs[downstreamrefpos][2].upper() + downstreamqueryseq = mergedalignedpairs[downstreamrefpos][4].upper() + except KeyError: + downstreamrefseq, downstreamqueryseq = None, None + downstreamrefseq = revcomp(downstreamrefseq) + downstreamqueryseq = revcomp(downstreamqueryseq) + if downstreamrefseq == 'G': + conv2 = 'ng_xg' + #If this is a non-g deletion and is downstream of a g, we can't be sure if this deletion is due to this nucleotide or the downstream g + if conv in ['a_x', 't_x', 'c_x']: + conv = None + + #Add conv(s) to dictionary + if r1quality >= minPhred and onlyoverlap == False: + # there will be some conversions (e.g. a_x that are not in convs) + if conv in convs: + convs[conv] +=1 + if conv2 == 'ng_xg' and conv != 'g_x': + convs[conv2] +=1 + + elif r1querypos == 'NA' and r2querypos != 'NA': #this position is queried by r2 only + if read1strand == '-': + # reference seq is independent of which read we are talking about + #Read1 is always the sense strand. r1queryseq and r2queryseq are always + strand + #The reference sequence is always on the + strand. + #If read1 is on the - strand, we have already flipped reference seq (see a few lines above). + #We need to flip read2queryseq so that it is also - strand. + r2refseq = revcomp(r2refseq) + r2queryseq = revcomp(r2queryseq) + + if r2refseq == 'N' or r2refseq == 'n': + continue + + conv = r2refseq.lower() + '_' + r2refseq.lower() + if r2queryseq == 'X': + #Check if there is a reference G downstream of this position + if read1strand == '+': + downstreamrefpos = refpos + 1 + try: + downstreamrefseq = mergedalignedpairs[downstreamrefpos][2].upper() + downstreamqueryseq = mergedalignedpairs[downstreamrefpos][4].upper() + except KeyError: + downstreamrefseq, downstreamqueryseq = None, None + elif read1strand == '-': + downstreamrefpos = refpos - 1 + try: + downstreamrefseq = mergedalignedpairs[downstreamrefpos][2].upper() + downstreamqueryseq = mergedalignedpairs[downstreamrefpos][4].upper() + except KeyError: + downstreamrefseq, downstreamqueryseq = None, None + downstreamrefseq = revcomp(downstreamrefseq) + downstreamqueryseq = revcomp(downstreamqueryseq) + if downstreamrefseq == 'G': + conv2 = 'ng_xg' + if conv in ['a_x', 't_x', 'c_x']: + conv = None + + #Add conv(s) to dictionary + if r2quality >= minPhred and onlyoverlap == False: + if conv in convs: + convs[conv] +=1 + if conv2 == 'ng_xg' and conv != 'g_x': + convs[conv2] +=1 + + elif r1querypos != 'NA' and r2querypos != 'NA': #this position is queried by both reads + if read1strand == '-': + r1refseq = revcomp(r1refseq) + r2refseq = revcomp(r2refseq) + r1queryseq = revcomp(r1queryseq) + r2queryseq = revcomp(r2queryseq) + + if r1refseq == 'N' or r2refseq == 'N' or r1refseq == 'n' or r2refseq == 'n': + continue + + r1result = r1refseq.lower() + '_' + r1queryseq.lower() + r2result = r2refseq.lower() + '_' + r2queryseq.lower() + + #Only record if r1 and r2 agree about what is going on + if r1result == r2result: + conv = r1refseq.lower() + '_' + r1queryseq.lower() + if r1queryseq == 'X': + #Check if there is a reference G downstream of this position + if read1strand == '+': + downstreamrefpos = refpos + 1 + try: + downstreamrefseq = mergedalignedpairs[downstreamrefpos][2].upper() + downstreamqueryseq = mergedalignedpairs[downstreamrefpos][4].upper() + except KeyError: + downstreamrefseq, downstreamqueryseq = None, None + elif read1strand == '-': + downstreamrefpos = refpos - 1 + try: + downstreamrefseq = mergedalignedpairs[downstreamrefpos][2].upper() + downstreamqueryseq = mergedalignedpairs[downstreamrefpos][4].upper() + except KeyError: + downstreamrefseq, downstreamqueryseq = None, None + downstreamrefseq = revcomp(downstreamrefseq) + downstreamqueryseq = revcomp(downstreamqueryseq) + if downstreamrefseq == 'G': + conv2 = 'ng_xg' + if conv in ['a_x', 't_x', 'c_x']: + conv = None + + #Add conv(s) to dictionary + #Only do conv2 (ng_xg) if conv is not g_x + if r1quality >= minPhred and r2quality >= minPhred: + if conv in convs: + convs[conv] +=1 + if conv2 == 'ng_xg' and conv != 'g_x': + convs[conv2] +=1 + + elif r1querypos == 'NA' and r2querypos == 'NA': #if we are using only read1 or read2, it's possible for this position in both reads to be NA + continue + + #Does the number of g_t and/or g_c conversions meet our threshold? + allconvs = ['g_c', 'g_t', 'g_x', 'ng_xg'] + convoptions = [use_g_c, use_g_t, use_g_x, use_ng_xg] + selectedconvs = [] + for ind, x in enumerate(allconvs): + if convoptions[ind] == True: + selectedconvs.append(x) + if not selectedconvs: + print('ERROR: we must be looking for at least one conversion type.') + sys.exit() + + nConv_in_read = 0 + for x in selectedconvs: + nConv_in_read += convs[x] + if nConv_in_read < nConv: + for x in allconvs: + convs[x] = 0 + + return convs + + +def summarize_convs(convs, outfile): + #Take in a dictionary of conversions and calculate conversion rates for all conversions + #convs has the form {readid : {a_a : count, a_t : count, etc.}} + + a = 0 #running counter of all *reference* a's (converted and not converted) + a_t = 0 #running counter of all a to t conversions + a_c = 0 + a_g = 0 + a_n = 0 + c = 0 + c_t = 0 + c_g = 0 + c_a = 0 + c_n = 0 + g = 0 + g_t = 0 + g_a = 0 + g_c = 0 + g_n = 0 + t = 0 + t_a = 0 + t_g = 0 + t_c = 0 + t_n = 0 + g_x = 0 + ng_xg = 0 + + for read in convs: + conv_in_read = convs[read] + a += (conv_in_read['a_a'] + conv_in_read['a_t'] + conv_in_read['a_c'] + conv_in_read['a_g'] + conv_in_read['a_n']) + a_t += conv_in_read['a_t'] + a_c += conv_in_read['a_c'] + a_g += conv_in_read['a_g'] + a_n += conv_in_read['a_n'] + + c += (conv_in_read['c_a'] + conv_in_read['c_t'] + conv_in_read['c_c'] + conv_in_read['c_g'] + conv_in_read['c_n']) + c_t += conv_in_read['c_t'] + c_a += conv_in_read['c_a'] + c_g += conv_in_read['c_g'] + c_n += conv_in_read['c_n'] + + g += (conv_in_read['g_a'] + conv_in_read['g_t'] + conv_in_read['g_c'] + conv_in_read['g_g'] + conv_in_read['g_n'] + conv_in_read['g_x']) + g_t += conv_in_read['g_t'] + g_a += conv_in_read['g_a'] + g_c += conv_in_read['g_c'] + g_n += conv_in_read['g_n'] + g_x += conv_in_read['g_x'] + ng_xg += conv_in_read['ng_xg'] + + t += (conv_in_read['t_a'] + conv_in_read['t_t'] + conv_in_read['t_c'] + conv_in_read['t_g'] + conv_in_read['t_n']) + t_g += conv_in_read['t_g'] + t_a += conv_in_read['t_a'] + t_c += conv_in_read['t_c'] + t_n += conv_in_read['t_n'] + + totalnt = a + c + g + t + totalconv = a_t + a_c + a_g + c_t + c_a + c_g + g_t + g_a + g_c + t_g + t_a + t_c + + totalerrorrate = totalconv / totalnt + + with open(outfile, 'w') as outfh: + outfh.write(('\t').join([ + 'Acount', 'A_Tcount', 'A_Ccount', 'A_Gcount', 'A_Ncount', + 'Ccount', 'C_Tcount', 'C_Acount', 'C_Gcount', 'C_Ncount', + 'Gcount', 'G_Tcount', 'G_Ccount', 'G_Acount', 'G_Ncount', + 'Tcount', 'T_Acount', 'T_Ccount', 'T_Gcount', 'T_Ncount', + 'A_Trate', 'A_Crate', 'A_Grate', 'A_Nrate', + 'C_Trate', 'C_Arate', 'C_Grate', 'C_Nrate', + 'G_Trate', 'G_Crate', 'G_Arate', 'G_Nrate', 'G_Xrate', 'NG_XGrate', + 'T_Arate', 'T_Crate', 'T_Grate', 'T_Nrate', 'totalnt', 'totalconv', 'totalerrorrate' + ]) + '\n') + + outfh.write(('\t').join([ + str(a), str(a_t), str(a_c), str(a_g), str(a_n), + str(c), str(c_t), str(c_a), str(c_g), str(c_n), + str(g), str(g_t), str(g_c), str(g_a), str(g_n), + str(t), str(t_a), str(t_c), str(t_g), str(t_n), + str(a_t / a), str(a_c / a), str(a_g / a), str(a_n / a), + str(c_t / c), str(c_a / c), str(c_g / c), str(c_n / c), + str(g_t / g), str(g_c / g), str(g_a / g), str(g_n / g), str(g_x / g), str(ng_xg / g), + str(t_a / t), str(t_c / t), str(t_g / t), str(t_n / t), + str(totalnt), str(totalconv), str(totalerrorrate) + ])) + +def split_bam(bam, nproc): + #Split bam using samtools instead of bamtools (may be faster) + #Check for index + if os.path.exists(os.path.abspath(bam) + '.bai'): + pass + else: + indexCMD = 'samtools index ' + os.path.abspath(bam) + index = subprocess.Popen(indexCMD, shell = True) + index.wait() + + #Get chromosome names + idxstatsCMD = 'samtools idxstats ' + os.path.abspath(bam) + idxstats = subprocess.Popen(idxstatsCMD, shell = True, stdout = subprocess.PIPE) + chrms = [] + for line in idxstats.stdout: + line = line.decode('UTF-8').strip().split('\t') + chrm = line[0] + if chrm != '*': + chrms.append(chrm) + + splitbams = [] + for chrm in chrms: + filebasename = chrm + '_SPLIT.bam' + filepath = os.path.join(os.path.dirname(os.path.abspath(bam)), filebasename) + splitbams.append(filepath) + splitCMD = 'samtools view -@ ' + str(nproc) + ' -b ' + os.path.abspath(bam) + ' ' + chrm + ' > ' + filepath + s = subprocess.Popen(splitCMD, shell = True) + s.wait() + + return splitbams + + +def getmismatches(datatype, bam, onlyConsiderOverlap, snps, maskpositions, nConv, minMappingQual, nproc, use_g_t, use_g_c, use_g_x, use_ng_xg, use_read1, use_read2, minPhred=30): + #Actually run the mismatch code (calling iteratereads_pairedend) + #use multiprocessing + #If there's only one processor, easier to use iteratereads_pairedend() directly. + + pool = mp.Pool(processes = int(nproc)) + print('Using {0} processors to identify mismatches in {1}.'.format(nproc, bam)) + splitbams = split_bam(bam, int(nproc)) + argslist = [] + for x in splitbams: + if datatype == 'paired': + argslist.append((x, bool(onlyConsiderOverlap), bool( + use_g_t), bool(use_g_c), bool(use_g_x), bool(use_ng_xg), bool(use_read1), bool(use_read2), nConv, minMappingQual, minPhred, snps, maskpositions, 'low')) + elif datatype == 'single': + argslist.append((x, bool(use_g_t), bool(use_g_c), bool(use_g_x), bool(use_ng_xg), nConv, minMappingQual, minPhred, snps, maskpositions, 'low')) + + #items returned from iteratereads_pairedend are in a list, one per process + totalreadcounter = 0 #number of reads across all the split bams + if datatype == 'paired': + results = pool.starmap(iteratereads_pairedend, argslist) #thhis actually returns two things, convs and readcounter + elif datatype == 'single': + results = pool.starmap(iteratereads_singleend, argslist) + #so i bet this is a nested list where the first item in each list in a convs and the second item is a readcounter + convs_split = [] + for result in results: + convs_split.append(result[0]) + for result in results: + totalreadcounter += result[1] + + if datatype == 'paired': + print('Queried {0} read pairs in {1}.'.format(totalreadcounter, os.path.basename(bam))) + elif datatype == 'single': + print('Queried {0} reads in {1}.'.format(totalreadcounter, os.path.basename(bam))) + + #Reorganize convs_split into convs as it is without multiprocessing + convs = {} #{readid : dictionary of all conversions} + for splitconv in convs_split: + convs.update(splitconv) + + #cleanup + for splitbam in splitbams: + os.remove(splitbam) + + return convs + + + + + +if __name__ == '__main__': + #convs, readcounter = iteratereads_pairedend(sys.argv[1], True, True, True, True, True, True, True, 1, 255, 30, None, None, 'high') + convs, readcounter = iteratereads_singleend(sys.argv[1], True, True, True, True, 1, 60, 30, None, None, 'high') + summarize_convs(convs, sys.argv[2]) + + diff --git a/src/pigpen/getmismatches_MPRA.py b/src/pigpen/getmismatches_MPRA.py new file mode 100644 index 0000000..51be1bc --- /dev/null +++ b/src/pigpen/getmismatches_MPRA.py @@ -0,0 +1,460 @@ +import pysam +import os +import re +import sys +from collections import defaultdict +import numpy as np +import pandas as pd +import subprocess +import argparse + + +def revcomp(nt): + revcompdict = { + 'G' : 'C', + 'C' : 'G', + 'A' : 'T', + 'T' : 'A', + 'N' : 'N', + 'g' : 'c', + 'c' : 'g', + 'a' : 't', + 't' : 'a', + 'n' : 'n', + 'X': 'X', + None: None, + 'NA': 'NA' + } + + nt_rc = revcompdict[nt] + + return nt_rc + + +def read_pair_generator(bam, region_string=None): + """ + Generate read pairs in a BAM file or within a region string. + Reads are added to read_dict until a pair is found. + https://www.biostars.org/p/306041/ + """ + read_dict = defaultdict(lambda: [None, None]) + for read in bam: + if not read.is_proper_pair or read.is_secondary or read.is_supplementary or read.mate_is_unmapped or read.is_unmapped: + continue + qname = read.query_name + if qname not in read_dict: + if read.is_read1: + read_dict[qname][0] = read + else: + read_dict[qname][1] = read + else: + if read.is_read1: + yield read, read_dict[qname][1] + else: + yield read_dict[qname][0], read + del read_dict[qname] + + +def iteratereads_pairedend(bam, onlyConsiderOverlap, use_read1, use_read2, nConv, minMappingQual, minPhred=30): + #Iterate over reads in a paired end alignment file. + #Find nt conversion locations for each read. + #For locations interrogated by both mates of read pair, conversion must exist in both mates in order to count + #Store the number of conversions for each read in a dictionary + + #Quality score array is always in the same order as query_sequence, which is always on the + strand + #Bam must contain MD tags + + if onlyConsiderOverlap == 'True': + onlyConsiderOverlap = True + elif onlyConsiderOverlap == 'False': + onlyConsiderOverlap = False + + queriednts = [] + readcounter = 0 + convs = {} # {oligoname : [readcount, {dictionary of all conversions}] + save = pysam.set_verbosity(0) + with pysam.AlignmentFile(bam, 'r') as infh: + print('Finding nucleotide conversions in {0}...'.format( + os.path.basename(bam))) + for read1, read2 in read_pair_generator(infh): + + #Just double check that the pairs are matched + if read1.query_name != read2.query_name: + continue + + if read1.reference_name != read2.reference_name: + continue + + #Check mapping quality + #MapQ is 255 for uniquely aligned reads FOR STAR ONLY + #MapQ is (I think) >= 2 for uniquely aligned reads from bowtie2 + if read1.mapping_quality < minMappingQual or read2.mapping_quality < minMappingQual: + continue + + readcounter += 1 + if readcounter % 1000000 == 0: + print('Finding nucleotide conversions in read {0}...'.format( + readcounter)) + + oligo = read1.reference_name + read1queryseq = read1.query_sequence + read1alignedpairs = read1.get_aligned_pairs(with_seq=True) + if read1.is_reverse: + read1strand = '-' + elif not read1.is_reverse: + read1strand = '+' + + read2queryseq = read2.query_sequence + read2alignedpairs = read2.get_aligned_pairs(with_seq=True) + if read2.is_reverse: + read2strand = '-' + elif not read2.is_reverse: + read2strand = '+' + + read1qualities = list(read1.query_qualities) # phred scores + read2qualities = list(read2.query_qualities) + + convs_in_read = getmismatches_pairedend(read1alignedpairs, read2alignedpairs, read1queryseq, read2queryseq, read1qualities, read2qualities, read1strand, read2strand, onlyConsiderOverlap, nConv, minPhred, use_read1, use_read2) + queriednts.append(sum(convs_in_read.values())) + + #Add to our running dictionary + if oligo not in convs: + convs[oligo] = [1, convs_in_read] + elif oligo in convs: + readcount = convs[oligo][0] + readcount +=1 + oligodict = convs[oligo][1] + for conv in convs_in_read: + convcount = convs_in_read[conv] + oligodict[conv] += convcount + convs[oligo] = [readcount, oligodict] + + return convs + + +def getmismatches_pairedend(read1alignedpairs, read2alignedpairs, read1queryseq, read2queryseq, read1qualities, read2qualities, read1strand, read2strand, masklocations, onlyoverlap, nConv, minPhred, use_read1, use_read2): + #remove tuples that have None + #These are either intronic or might have been soft-clipped + #Tuples are (querypos, refpos, refsequence) + #If there is a substitution, refsequence is lower case + + #For now, forget insertions. Get rid of any position where reference position is None. + read1alignedpairs = [x for x in read1alignedpairs if x[1] != None] + read2alignedpairs = [x for x in read2alignedpairs if x[1] != None] + read1alignedpairs = [x for x in read1alignedpairs if x[2] != None] + read2alignedpairs = [x for x in read2alignedpairs if x[2] != None] + + #Add quality scores and query sequences + #will now be (querypos, refpos, refsequence, querysequence, qualscore) + for x in range(len(read1alignedpairs)): + alignedpair = read1alignedpairs[x] + querypos = alignedpair[0] + if querypos != None: + querynt = read1queryseq[querypos] + qualscore = read1qualities[querypos] + elif querypos == None: + querynt = 'X' # there is no query nt for a deletion + qualscore = 37 # there's no query position here, so make up a quality score + alignedpair = alignedpair + (querynt, qualscore) + read1alignedpairs[x] = alignedpair + + for x in range(len(read2alignedpairs)): + alignedpair = read2alignedpairs[x] + querypos = alignedpair[0] + if querypos != None: + querynt = read2queryseq[querypos] + qualscore = read2qualities[querypos] + elif querypos == None: + querynt = 'X' + qualscore = 37 + alignedpair = alignedpair + (querynt, qualscore) + read2alignedpairs[x] = alignedpair + + #if we have locations to mask, remove their locations from read1alignedpairs and read2alignedpairs + #masklocations is a set of 0-based coordinates of snp locations to mask + if masklocations: + read1alignedpairs = [x for x in read1alignedpairs if x[1] not in masklocations] + read2alignedpairs = [x for x in read2alignedpairs if x[1] not in masklocations] + + convs = {} #counts of conversions x_y where x is reference sequence and y is query sequence + + possibleconvs = [ + 'a_a', 'a_t', 'a_c', 'a_g', 'a_n', + 'g_a', 'g_t', 'g_c', 'g_g', 'g_n', + 'c_a', 'c_t', 'c_c', 'c_g', 'c_n', + 't_a', 't_t', 't_c', 't_g', 't_n', + 'a_x', 'g_x', 'c_x', 't_x', 'ng_xg'] + + #initialize dictionary + for conv in possibleconvs: + convs[conv] = 0 + + #For locations interrogated by both mates of read pair, conversion must exist in both mates in order to count + #These locations (as defined by their reference positions) would be found both in read1alignedpairs and read2alignedpairs + + #Get the ref positions queried by the two reads + r1dict = {} #{reference position: [queryposition, reference sequence, querysequence, quality]} + r2dict = {} + for x in read1alignedpairs: + r1dict[int(x[1])] = [x[0], x[2].upper(), x[3].upper(), x[4]] + for x in read2alignedpairs: + r2dict[int(x[1])] = [x[0], x[2].upper(), x[3].upper(), x[4]] + + # {refpos : [R1querypos, R2querypos, R1refsequence, R2refsequence, R1querysequence, R2querysequence, R1quality, R2quality]} + mergedalignedpairs = {} + #For positions only in R1 or R2, querypos and refsequence are NA for the other read + for refpos in r1dict: + r1querypos = r1dict[refpos][0] + r1refseq = r1dict[refpos][1] + r1queryseq = r1dict[refpos][2] + r1quality = r1dict[refpos][3] + if refpos in mergedalignedpairs: #this should not be possible because we are looking at r1 first + r2querypos = mergedalignedpairs[refpos][1] + r2refseq = mergedalignedpairs[refpos][3] + r2queryseq = mergedalignedpairs[refpos][5] + r2quality = mergedalignedpairs[refpos][7] + mergedalignedpairs[refpos] = [r1querypos, r2querypos, r1refseq, + r2refseq, r1queryseq, r2queryseq, r1quality, r2quality] + else: + mergedalignedpairs[refpos] = [r1querypos, 'NA', r1refseq, 'NA', r1queryseq, 'NA', r1quality, 'NA'] + + for refpos in r2dict: + #same thing + r2querypos = r2dict[refpos][0] + r2refseq = r2dict[refpos][1] + r2queryseq = r2dict[refpos][2] + r2quality = r2dict[refpos][3] + if refpos in mergedalignedpairs: #if we saw it for r1 + r1querypos = mergedalignedpairs[refpos][0] + r1refseq = mergedalignedpairs[refpos][2] + r1queryseq = mergedalignedpairs[refpos][4] + r1quality = mergedalignedpairs[refpos][6] + mergedalignedpairs[refpos] = [r1querypos, r2querypos, r1refseq, + r2refseq, r1queryseq, r2queryseq, r1quality, r2quality] + else: + mergedalignedpairs[refpos] = ['NA', r2querypos, 'NA', r2refseq, 'NA', r2queryseq, 'NA', r2quality] + + #If we are only using read1 or only using read2, replace the positions in the non-used read with NA + for refpos in mergedalignedpairs: + r1querypos, r2querypos, r1refseq, r2refseq, r1queryseq, r2queryseq, r1quality, r2quality = mergedalignedpairs[refpos] + if use_read1 and not use_read2: + updatedlist = [r1querypos, 'NA', r1refseq, + 'NA', r1queryseq, 'NA', r1quality, 'NA'] + mergedalignedpairs[refpos] = updatedlist + elif use_read2 and not use_read1: + updatedlist = ['NA', r2querypos, 'NA', + r2refseq, 'NA', r2queryseq, 'NA', r2quality] + mergedalignedpairs[refpos] = updatedlist + elif not use_read1 and not use_read2: + print('ERROR: we have to use either read1 or read2, if not both.') + sys.exit() + elif use_read1 and use_read2: + pass + + #Now go through mergedalignedpairs, looking for conversions. + #For positions observed both in r1 and r2, queryseq in both reads must match, otherwise the position is skipped. + #We are now keeping track of deletions as either g_x (reference G, query deletion) or ng_xg (ref nt 5' of G deleted in query) + #We have observed that sometimes RT skips the nucleotide *after* an oxidized G (after being from the RT's point of view) + + for refpos in mergedalignedpairs: + conv = None + conv2 = None #sometimes we can have 2 convs (for example the first nt of ng_xg could be both g_x and ng_xg) + r1querypos = mergedalignedpairs[refpos][0] + r2querypos = mergedalignedpairs[refpos][1] + r1refseq = mergedalignedpairs[refpos][2] + r2refseq = mergedalignedpairs[refpos][3] + r1queryseq = mergedalignedpairs[refpos][4] + r2queryseq = mergedalignedpairs[refpos][5] + r1quality = mergedalignedpairs[refpos][6] + r2quality = mergedalignedpairs[refpos][7] + + if r1querypos != 'NA' and r2querypos == 'NA': #this position queried by r1 only + if read1strand == '-': + #refseq needs to equal the sense strand (it is always initially defined as the + strand). read1 is always the sense strand. + r1refseq = revcomp(r1refseq) + r1queryseq = revcomp(r1queryseq) + + #If reference is N, skip this position + if r1refseq == 'N' or r1refseq == 'n': + continue + conv = r1refseq.lower() + '_' + r1queryseq.lower() + + if r1queryseq == 'X': + #Check if there is a reference G downstream of this position + if read1strand == '+': + downstreamrefpos = refpos + 1 + #It's possible that downstreamrefpos is not in mergedalignedpairs because this position is at the end of the read + try: + downstreamrefseq = mergedalignedpairs[downstreamrefpos][2].upper() + downstreamqueryseq = mergedalignedpairs[downstreamrefpos][4].upper() + except KeyError: + downstreamrefseq, downstreamqueryseq = None, None + elif read1strand == '-': + downstreamrefpos = refpos - 1 + try: + downstreamrefseq = mergedalignedpairs[downstreamrefpos][2].upper() + downstreamqueryseq = mergedalignedpairs[downstreamrefpos][4].upper() + except KeyError: + downstreamrefseq, downstreamqueryseq = None, None + downstreamrefseq = revcomp(downstreamrefseq) + downstreamqueryseq = revcomp(downstreamqueryseq) + if downstreamrefseq == 'G': + conv2 = 'ng_xg' + #If this is a non-g deletion and is downstream of a g, we can't be sure if this deletion is due to this nucleotide or the downstream g + if conv in ['a_x', 't_x', 'c_x']: + conv = None + + #Add conv(s) to dictionary + if r1quality >= minPhred and onlyoverlap == False: + # there will be some conversions (e.g. a_x that are not in convs) + if conv in convs: + convs[conv] +=1 + if conv2 == 'ng_xg' and conv != 'g_x': + convs[conv2] +=1 + + elif r1querypos == 'NA' and r2querypos != 'NA': #this position is queried by r2 only + if read1strand == '-': + # reference seq is independent of which read we are talking about + #Read1 is always the sense strand. r1queryseq and r2queryseq are always + strand + #The reference sequence is always on the + strand. + #If read1 is on the - strand, we have already flipped reference seq (see a few lines above). + #We need to flip read2queryseq so that it is also - strand. + r2refseq = revcomp(r2refseq) + r2queryseq = revcomp(r2queryseq) + + if r2refseq == 'N' or r2refseq == 'n': + continue + + conv = r2refseq.lower() + '_' + r2refseq.lower() + if r2queryseq == 'X': + #Check if there is a reference G downstream of this position + if read1strand == '+': + downstreamrefpos = refpos + 1 + try: + downstreamrefseq = mergedalignedpairs[downstreamrefpos][2].upper() + downstreamqueryseq = mergedalignedpairs[downstreamrefpos][4].upper() + except KeyError: + downstreamrefseq, downstreamqueryseq = None, None + elif read1strand == '-': + downstreamrefpos = refpos - 1 + try: + downstreamrefseq = mergedalignedpairs[downstreamrefpos][2].upper() + downstreamqueryseq = mergedalignedpairs[downstreamrefpos][4].upper() + except KeyError: + downstreamrefseq, downstreamqueryseq = None, None + downstreamrefseq = revcomp(downstreamrefseq) + downstreamqueryseq = revcomp(downstreamqueryseq) + if downstreamrefseq == 'G': + conv2 = 'ng_xg' + if conv in ['a_x', 't_x', 'c_x']: + conv = None + + #Add conv(s) to dictionary + if r2quality >= minPhred and onlyoverlap == False: + if conv in convs: + convs[conv] +=1 + if conv2 == 'ng_xg' and conv != 'g_x': + convs[conv2] +=1 + + elif r1querypos != 'NA' and r2querypos != 'NA': #this position is queried by both reads + if read1strand == '-': + r1refseq = revcomp(r1refseq) + r2refseq = revcomp(r2refseq) + r1queryseq = revcomp(r1queryseq) + r2queryseq = revcomp(r2queryseq) + + if r1refseq == 'N' or r2refseq == 'N' or r1refseq == 'n' or r2refseq == 'n': + continue + + r1result = r1refseq.lower() + '_' + r1queryseq.lower() + r2result = r2refseq.lower() + '_' + r2queryseq.lower() + + #Only record if r1 and r2 agree about what is going on + if r1result == r2result: + conv = r1refseq.lower() + '_' + r1queryseq.lower() + if r1queryseq == 'X': + #Check if there is a reference G downstream of this position + if read1strand == '+': + downstreamrefpos = refpos + 1 + try: + downstreamrefseq = mergedalignedpairs[downstreamrefpos][2].upper() + downstreamqueryseq = mergedalignedpairs[downstreamrefpos][4].upper() + except KeyError: + downstreamrefseq, downstreamqueryseq = None, None + elif read1strand == '-': + downstreamrefpos = refpos - 1 + try: + downstreamrefseq = mergedalignedpairs[downstreamrefpos][2].upper() + downstreamqueryseq = mergedalignedpairs[downstreamrefpos][4].upper() + except KeyError: + downstreamrefseq, downstreamqueryseq = None, None + downstreamrefseq = revcomp(downstreamrefseq) + downstreamqueryseq = revcomp(downstreamqueryseq) + if downstreamrefseq == 'G': + conv2 = 'ng_xg' + if conv in ['a_x', 't_x', 'c_x']: + conv = None + + #Add conv(s) to dictionary + #Only do conv2 (ng_xg) if conv is not g_x + if r1quality >= minPhred and r2quality >= minPhred: + if conv in convs: + convs[conv] +=1 + if conv2 == 'ng_xg' and conv != 'g_x': + convs[conv2] +=1 + + elif r1querypos == 'NA' and r2querypos == 'NA': #if we are using only read1 or read2, it's possible for this position in both reads to be NA + continue + + #Does the number of t_c conversions meet our threshold? + if convs['t_c'] >= nConv: + pass + elif convs['t_c'] < nConv: + convs['t_c'] = 0 + + return convs + +def writeOutput(convs, outfile): + #Write conv dict in text output table + # + possibleconvs = [ + 'a_a', 'a_t', 'a_c', 'a_g', 'a_n', + 'g_a', 'g_t', 'g_c', 'g_g', 'g_n', + 'c_a', 'c_t', 'c_c', 'c_g', 'c_n', + 't_a', 't_t', 't_c', 't_g', 't_n'] + + headerlist = ['oligo', 'readcount'] + possibleconvs + with open(outfile, 'w') as outfh: + outfh.write(('\t').join(headerlist) + '\n') + for oligo in convs: + readcount = convs[oligo][0] + outfh.write(oligo + '\t' + str(readcount) + '\t') + for possibleconv in possibleconvs: + v = str(convs[oligo][1][possibleconv]) + if possibleconv != 't_n': + outfh.write(v + '\t') + elif possibleconv == 't_n': + outfh.write(v + '\n') + +if __name__ == '__main__': + parser = argparse.ArgumentParser(description = 'Count mismatches in MPRA sequencing data.') + parser.add_argument('--bam', type = str, help = 'Alignment file. Ideally from bowtie2.') + parser.add_argument('--onlyConsiderOverlap', action='store_true', + help='Only consider conversions seen in both reads of a read pair? Only possible with paired end data.') + parser.add_argument('--use_read1', action='store_true', + help='Use read1 when looking for conversions? Only useful with paired end data.') + parser.add_argument('--use_read2', action='store_true', + help='Use read2 when looking for conversions? Only useful with paired end data.') + parser.add_argument('--minMappingQual', type=int, + help='Minimum mapping quality for a read to be considered in conversion counting. bowtie2 unique mappers have MAPQ >=2.', required=True) + parser.add_argument('--minPhred', type = int, help = 'Minimum phred score for a nucleotide to be considered.') + parser.add_argument( + '--nConv', type=int, help='Minimum number of required T->C conversions in a read pair in order for those conversions to be counted. Default is 1.', default=1) + parser.add_argument('--output', type = str, help = 'Output file.') + args = parser.parse_args() + + convs = iteratereads_pairedend(args.bam, args.onlyConsiderOverlap, args.use_read1, args.use_read2, args.nConv, args.minMappingQual, args.minPhred) + writeOutput(convs, args.output) + + + diff --git a/src/pigpen/maskpositions.py b/src/pigpen/maskpositions.py new file mode 100644 index 0000000..f787fff --- /dev/null +++ b/src/pigpen/maskpositions.py @@ -0,0 +1,32 @@ +#Given a bed of positions to mask, extract them and store them in a dictionary +#so they can be given to getmismatches.py. +#Beds are 0-based half-open, and we want to give getmismatches 0-based coordinates for easy +#interfacing with pysam. + +def readmaskbed(maskbed): + maskdict = {} #{chrm : (set of positions to mask)} + with open(maskbed, 'r') as infh: + for line in infh: + line = line.strip().split('\t') + chrm = line[0] + start = int(line[1]) + end = int(line[2]) + interval = list(range(start, end)) + if chrm not in maskdict: + maskdict[chrm] = [] + maskdict[chrm].extend(interval) + + + #Remove duplicates + for chrm in maskdict: + coords = set(maskdict[chrm]) + maskdict[chrm] = coords + + #Tell us how many positions we are masking + totalmask = 0 + for chrm in maskdict: + totalmask += len(maskdict[chrm]) + + print('Manually masking {0} positions.'.format(totalmask)) + + return maskdict \ No newline at end of file diff --git a/src/pigpen/parsebamreadcount.py b/src/pigpen/parsebamreadcount.py new file mode 100644 index 0000000..b5cbbdb --- /dev/null +++ b/src/pigpen/parsebamreadcount.py @@ -0,0 +1,56 @@ +import sys +import os +import argparse + +#Take the output of bam-readcount and parse to get a table of nt fractions at each position + +def parsebrc(bamreadcountout, covthresh, outfile): + d = {} #{chrm : {position : [ref, depth, Afrac, Tfrac, Gfrac, Cfrac]}} + + with open(bamreadcountout, 'r') as infh: + for line in infh: + line = line.strip().split('\t') + chrm = line[0] + position = int(line[1]) + ref = line[2] + depth = int(line[3]) + if depth < covthresh: + continue + + if chrm not in d: + d[chrm] = {} + d[chrm][position] = [ref, depth] + ntfracs = {} #{nt : frac of reads} + for f in line[4:]: + fsplit = f.split(':') + nt = fsplit[0] + if nt not in ['A', 'C', 'T', 'G']: + continue + ntcount = int(fsplit[1]) + ntfrac = ntcount / depth + ntfrac = f'{ntfrac:.2e}' + ntfrac = float(ntfrac) + if nt == ref: + ntfrac = 'NA' + ntfracs[nt] = ntfrac + + d[chrm][position].extend([ntfracs['A'], ntfracs['T'], ntfracs['G'], ntfracs['C']]) + + with open(outfile, 'w') as outfh: + outfh.write(('\t').join(['chrm', 'position', 'ref', 'depth', 'Afrac', 'Tfrac', 'Gfrac', 'Cfrac']) + '\n') + for chrm in d: + for position in d[chrm]: + ref, depth, afrac, tfrac, gfrac, cfrac = d[chrm][position] + outfh.write(('\t').join([chrm, str(position), ref, str(depth), str(afrac), str(tfrac), str(gfrac), str(cfrac)]) + '\n') + + +if __name__ == '__main__': + parser = argparse.ArgumentParser() + parser.add_argument('--bamreadcountout', type = str, help = 'Output of bam-readcount.') + parser.add_argument('--mindepth', type = int, help = 'Minimum depth for a position to be considered.') + parser.add_argument('--outfile', type = str, help = 'Output file.') + args = parser.parse_args() + + parsebrc(args.bamreadcountout, args.mindepth, args.outfile) + + \ No newline at end of file diff --git a/src/pigpen/runpigpen.py b/src/pigpen/runpigpen.py new file mode 100644 index 0000000..f74e6fe --- /dev/null +++ b/src/pigpen/runpigpen.py @@ -0,0 +1,217 @@ +#!/usr/bin/env python + +#Pipeline for Identification of Guanosine Positions Erroneously Notated +#PIGPEN + +import argparse +import subprocess +import os +import sys +from pigpen.snps import getSNPs, recordSNPs +from pigpen.maskpositions import readmaskbed +from pigpen.getmismatches import iteratereads_singleend, iteratereads_pairedend, getmismatches +from pigpen.assignreads import getReadOverlaps, processOverlaps +from pigpen.conversionsPerGene import getPerGene, writeConvsPerGene + + +parser = argparse.ArgumentParser(description=' ,-,-----,\n PIGPEN **** \\ \\ ),)`-\'\n <`--\'> \\ \\` \n /. . `-----,\n OINC! > (\'\') , @~\n `-._, ___ /\n-|-|-|-|-|-|-|-| (( / (( / -|-|-| \n|-|-|-|-|-|-|-|- \'\'\' \'\'\' -|-|-|-\n-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|\n\n Pipeline for Identification \n Of Guanosine Positions\n Erroneously Notated', formatter_class = argparse.RawDescriptionHelpFormatter) +parser.add_argument('--datatype', type = str, choices = ['single', 'paired'], required = True, help = 'Single end or paired end data?') +parser.add_argument('--samplenames', type = str, help = 'Comma separated list of samples to quantify.', required = True) +parser.add_argument('--controlsamples', type = str, help = 'Comma separated list of control samples (i.e. those where no *induced* conversions are expected). May be a subset of samplenames. Required if SNPs are to be considered and a snpfile is not supplied.') +parser.add_argument('--gff', type = str, help = 'Genome annotation in gff format.') +parser.add_argument('--gfftype', type = str, help = 'Source of genome annotation file.', choices = ['GENCODE', 'Ensembl'], required = True) +parser.add_argument('--genomeFasta', type = str, help = 'Genome sequence in fasta format. Required if SNPs are to be considered.') +parser.add_argument('--nproc', type = int, help = 'Number of processors to use. Default is 1.', default = 1) +parser.add_argument('--useSNPs', action = 'store_true', help = 'Consider SNPs?') +parser.add_argument('--snpfile', type = str, help = 'VCF file of snps to mask. If --useSNPs but a --snpfile is not supplied, a VCF of snps will be created using --controlsamples.') +parser.add_argument('--maskbed', help = 'Optional. Bed file of positions to mask from analysis.', default = None) +parser.add_argument('--ROIbed', help = 'Optional. Bed file of specific regions of interest in which to quantify conversions. If supplied, only conversions in these regions will be quantified.', default = None) +parser.add_argument('--SNPcoverage', type = int, help = 'Minimum coverage to call SNPs. Default = 20', default = 20) +parser.add_argument('--SNPfreq', type = float, help = 'Minimum variant frequency to call SNPs. Default = 0.4', default = 0.4) +parser.add_argument('--onlyConsiderOverlap', action = 'store_true', help = 'Only consider conversions seen in both reads of a read pair? Only possible with paired end data.') +parser.add_argument('--use_g_t', action = 'store_true', help = 'Consider G->T conversions?') +parser.add_argument('--use_g_c', action = 'store_true', help = 'Consider G->C conversions?') +parser.add_argument('--use_g_x', action='store_true', help='Consider G->deletion conversions?') +parser.add_argument('--use_ng_xg', action='store_true', help='Consider NG->deletionG conversions?') +parser.add_argument('--use_read1', action = 'store_true', help = 'Use read1 when looking for conversions? Only useful with paired end data.') +parser.add_argument('--use_read2', action = 'store_true', help = 'Use read2 when looking for conversions? Only useful with paired end data.') +parser.add_argument('--minMappingQual', type = int, help = 'Minimum mapping quality for a read to be considered in conversion counting. STAR unique mappers have MAPQ 255.', required = True) +parser.add_argument('--minPhred', type = int, help = 'Minimum phred quality score for a base to be considered. Default = 30', default = 30) +parser.add_argument('--nConv', type = int, help = 'Minimum number of required G->T and/or G->C conversions in a read pair in order for those conversions to be counted. Default is 1.', default = 1) +parser.add_argument('--outputDir', type = str, help = 'Output directory.', required = True) +args = parser.parse_args() + +def main(): + #Run script from an entry point + + #What type of gff are we working with? + if args.gfftype == 'GENCODE': + from pigpen.assignreads_salmon import getpostmasterassignments, assigntotxs, collapsetogene, readspergene, writeOutput + elif args.gfftype == 'Ensembl': + from pigpen.assignreads_salmon_ensembl import getpostmasterassignments, assigntotxs, collapsetogene, readspergene, writeOutput + + #If we have single end data, considering overlap of paired reads or only one read doesn't make sense + if args.datatype == 'single': + args.onlyConsiderOverlap = False + args.use_read1 = False + args.use_read2 = False + + #Store command line arguments + suppliedargs = {} + for arg in vars(args): + if arg != 'samplenames': + suppliedargs[arg] = getattr(args, arg) + + #Take in list of samplenames to run pigpen on + #Derive quant.sf, STAR bams, and postmaster bams + samplenames = args.samplenames.split(',') + salmonquants = [os.path.join(x, 'salmon', '{0}.quant.sf'.format(x)) for x in samplenames] + starbams = [os.path.join(x, 'STAR', '{0}Aligned.sortedByCoord.out.bam'.format(x)) for x in samplenames] #non-deduplicated bams + #starbams = [os.path.join(x, 'STAR', '{0}.dedup.bam'.format(x)) for x in samplenames] + postmasterbams = [os.path.join(x, 'postmaster', '{0}.postmaster.bam'.format(x)) for x in samplenames] + + #Take in list of control samples, make list of their corresponding star bams for SNP calling + if args.controlsamples: + controlsamples = args.controlsamples.split(',') + controlindicies = [] + samplebams = [] + controlsamplebams = [] + for ind, x in enumerate(samplenames): + samplebams.append(starbams[ind]) + if args.controlsamples and x in controlsamples: + controlsamplebams.append(starbams[ind]) + + #We have to be either looking for G->T or G->C, if not both + if not args.use_g_t and not args.use_g_c and not args.use_g_x: + print('We have to either be looking for G->T or G->C or G->del, if not both! Add argument --use_g_t and/or --use_g_c and/or --use_g_x.') + sys.exit() + + #We have to be using either read1 or read2 if not both + if not args.use_read1 and not args.use_read2 and args.datatype == 'paired': + print('We need to use read1 or read2, if not both! Add argument --use_read1 and/or --use_read2.') + sys.exit() + + #If we want to only consider overlap, we have to be using both read1 and read2 + if args.onlyConsiderOverlap and (not args.use_read1 or not args.use_read2): + print('If we are only going to consider overlap between paired reads, we must use both read1 and read2.') + sys.exit() + + #Make vcf file for snps + if args.snpfile: + snps = recordSNPs(args.snpfile) + if args.useSNPs and not args.snpfile and not args.controlsamples: + print('ERROR: If we want to consider snps we either have to give control samples for finding snps or a vcf file of snps we already know!') + sys.exit() + if args.useSNPs and not args.snpfile: + if not os.path.exists('snps'): + os.mkdir('snps') + vcfFileNames = getSNPs(controlsamplebams, args.genomeFasta, args.SNPcoverage, args.SNPfreq) + for f in vcfFileNames: + csi = f + '.csi' + log = f[:-3] + '.log' + #Move files to snps directory + os.rename(f, os.path.join('snps', f)) + os.rename(csi, os.path.join('snps', csi)) + os.rename(log, os.path.join('snps', log)) + + os.rename('merged.vcf', os.path.join('snps', 'merged.vcf')) + os.rename('vcfconcat.log', os.path.join('snps', 'vcfconcat.log')) + snps = recordSNPs(os.path.join('snps', 'merged.vcf')) + + elif not args.useSNPs and not args.snpfile: + snps = None + + #Get positions to manually mask if given + if args.maskbed: + print('Getting positions to manually mask...') + maskpositions = readmaskbed(args.maskbed) + elif not args.maskbed: + maskpositions = None + + #If there is no supplied bedfile of regions of interest, + #for each sample, identify conversions, assign conversions to transcripts, + #and collapse transcript-level measurements to gene-level measurements. + if not args.ROIbed: + for ind, sample in enumerate(samplenames): + #Create parameter dictionary that is unique to this sample + sampleparams = suppliedargs + sampleparams['sample'] = sample + + print('Running PIGPEN for {0}...'.format(sample)) + + samplebam = samplebams[ind] + sampleparams['samplebam'] = os.path.abspath(samplebam) + if args.nproc == 1: + if args.datatype == 'paired': + convs, readcounter = iteratereads_pairedend(samplebam, args.onlyConsiderOverlap, args.use_g_t, args.use_g_c, args.use_g_x, args.use_ng_xg, + args.use_read1, args.use_read2, args.nConv, args.minMappingQual, args.minPhred, snps, maskpositions, 'high') + elif args.datatype == 'single': + convs, readcounter = iteratereads_singleend( + samplebam, args.use_g_t, args.use_g_c, args.use_g_x, args.use_ng_xg, args.nConv, args.minMappingQual, args.minPhred, snps, maskpositions, 'high') + elif args.nproc > 1: + convs = getmismatches(args.datatype, samplebam, args.onlyConsiderOverlap, snps, maskpositions, args.nConv, + args.minMappingQual, args.nproc, args.use_g_t, args.use_g_c, args.use_g_x, args.use_ng_xg, args.use_read1, args.use_read2, args.minPhred) + + print('Getting posterior probabilities from salmon alignment file...') + postmasterbam = postmasterbams[ind] + sampleparams['postmasterbam'] = os.path.abspath(postmasterbam) + pprobs = getpostmasterassignments(postmasterbam) + print('Assinging conversions to transcripts...') + txconvs = assigntotxs(pprobs, convs) + print('Collapsing transcript level conversion counts to gene level...') + tx2gene, geneid2genename, geneconvs = collapsetogene(txconvs, args.gff) + print('Counting number of reads assigned to each gene...') + salmonquant = salmonquants[ind] + sampleparams['salmonquant'] = os.path.abspath(salmonquant) + genecounts = readspergene(salmonquant, tx2gene) + print('Writing output...') + if not os.path.exists(args.outputDir): + os.mkdir(args.outputDir) + outputfile = os.path.join(args.outputDir, sample + '.pigpen.txt') + writeOutput(sampleparams, geneconvs, genecounts, geneid2genename, outputfile, args.use_g_t, args.use_g_c, args.use_g_x, args.use_ng_xg) + print('Done!') + + #If there is a bed file of regions of interest supplied, then use that. Don't use the salmon/postmaster quantifications. + elif args.ROIbed: + #Make fasta index + command = ['samtools', 'faidx', args.genomeFasta] + subprocess.call(command) + faidx = args.genomeFasta + '.fai' + + #Create chrsort + command = ['cut', '-f' '1,2', faidx] + with open('chrsort.txt', 'w') as outfh: + subprocess.run(command, stdout = outfh) + + for ind, sample in enumerate(samplenames): + #Create parameter dictionary that is unique to this sample + sampleparams = suppliedargs + sampleparams['sample'] = sample + + print('Running PIGPEN for {0}...'.format(sample)) + + samplebam = samplebams[ind] + sampleparams['samplebam'] = os.path.abspath(samplebam) + if args.nproc == 1: + if args.datatype == 'paired': + convs, readcounter = iteratereads_pairedend(samplebam, args.onlyConsiderOverlap, args.use_g_t, args.use_g_c, args.use_g_x, args.use_ng_xg, + args.use_read1, args.use_read2, args.nConv, args.minMappingQual, args.minPhred, snps, maskpositions, 'high') + elif args.datatype == 'single': + convs, readcounter = iteratereads_singleend( + samplebam, args.use_g_t, args.use_g_c, args.use_g_x, args.use_ng_xg, args.nConv, args.minMappingQual, args.minPhred, snps, maskpositions, 'high') + + elif args.nproc > 1: + convs = getmismatches(args.datatype, samplebam, args.onlyConsiderOverlap, snps, maskpositions, + args.nConv, args.minMappingQual, args.nproc, args.use_g_t, args.use_g_c, args.use_g_x, args.use_ng_xg, args.use_read1, args.use_read2, args.minPhred) + + print('Assigning reads to genes in supplied bed file...') + overlaps, numpairs = getReadOverlaps(samplebam, args.ROIbed, 'chrsort.txt') + read2gene = processOverlaps(overlaps, numpairs) + numreadspergene, convsPerGene = getPerGene(convs, read2gene) + if not os.path.exists(args.outputDir): + os.mkdir(args.outputDir) + outputfile = os.path.join(args.outputDir, sample + '.pigpen.txt') + writeConvsPerGene(sampleparams, numreadspergene, convsPerGene, outputfile, args.use_g_t, args.use_g_c, args.use_g_x, args.use_ng_xg) + +if __name__ == '__main__': + main() \ No newline at end of file diff --git a/src/pigpen/simulateOINCreads.py b/src/pigpen/simulateOINCreads.py new file mode 100644 index 0000000..52b013d --- /dev/null +++ b/src/pigpen/simulateOINCreads.py @@ -0,0 +1,134 @@ +#python >=3.6 +import random +import sys +import gzip + +#usage: python simulateOINCreads.py + +#This is the wildtype sequence of the amplicon. Paired end reads will read in from both ends. +seq = 'ACAGTCCATGCCATCACTGCCACCCAGAAGACTGTGGATGGCCCCTCCGGGAAACTGTGGCGTGATGGCCGCGGGGCTCTCCAGAACATCATCCCTGCCTCTACTGGCGCTGCCAAGGCTGTGGGCAAGGTCATCCCTGAGCTGAACGGGAAGCTCACTGGCATGGCCTTCCGTGTCCCCACTGCCAACGTGTCAGTGGTGGACCTGACCTGCCGTCTAGAAAAACCTGCCAAATATGATGACATCAAGAAGGTGGTGAAGCAGGCGTCGGAGGGCCCCCTCAAGGGCATCCTGGGCTACACTGAGCACCAGGTGGTCTCCTCTGACTTCAACAGCGACACCCACTCCTCCACCTTTGACGCTGGGGCTGGCATTGCCCTCAACGACCACTTTGTCAAGCTC' +#first 200 nt of above seq so that we can incorporate read overlap +seq = 'ACAGTCCATGCCATCACTGCCACCCAGAAGACTGTGGATGGCCCCTCCGGGAAACTGTGGCGTGATGGCCGCGGGGCTCTCCAGAACATCATCCCTGCCTCTACTGGCGCTGCCAAGGCTGTGGGCAAGGTCATCCCTGAGCTGAACGGGAAGCTCACTGGCATGGCCTTCCGTGTCCCCACTGCCAACGTGTCAGTGGT' + + +#Intrinsic (i.e. cell- or RT-derived) mutation rates +mutfreqs = {'A' : {'C' : 1e-5, 'T' : 0, 'G' : 2e-4}, +'C' : {'G' : 5e-5, 'T' : 1.5e-3, 'A' : 4e-4}, +'G' : {'C' : 3e-5, 'T' : 2e-6, 'A' : 8e-4}, +'T' : {'A' : 3e-6, 'C' : 7e-5, 'G' : 8e-6}} + +#Sequencing error rate +seqerrorrate = 0.001 + + +def revcomp(nt): + revcompdict = { + 'G': 'C', + 'C': 'G', + 'A': 'T', + 'T': 'A', + 'N': 'N', + 'g': 'c', + 'c': 'g', + 'a': 't', + 't': 'a', + 'n': 'n' + } + + nt_rc = revcompdict[nt] + + return nt_rc + +def makecDNAseq(wtseq, mutfreqs): + #Make the sequence of the cDNA for this read pair. + #This is intended to be the entire fragment. We will break it up + #into the read pairs (and incorporate sequencing errors) later. + #build sequence nt by nt + outseq = '' + for nt in wtseq: + possiblents = list(mutfreqs[nt].keys()) + possiblentfreqs = list(mutfreqs[nt].values()) + wtfreq = 1 - sum(possiblentfreqs) + #add chance nt will be wt + possiblents.append(nt) + possiblentfreqs.append(wtfreq) + + outnt = random.choices( + population = possiblents, + weights = possiblentfreqs, + k = 1 + ) + outnt = outnt[0] + outseq += outnt + + return outseq + +def addseqerrors(readseq, seqerrorrate): + #Given a read sequence, add simulated sequencing erros + outseq = '' + for nt in readseq: + #is there a sequencing error at this position? + possiblents = list(mutfreqs[nt].keys()) + possiblentfreqs = [seqerrorrate / 3] * 3 + #add chance of not having a sequencing error + possiblents.append(nt) + possiblentfreqs.append(1 - seqerrorrate) + + outnt = random.choices( + population = possiblents, + weights = possiblentfreqs, + k = 1 + ) + outnt = outnt[0] + outseq += outnt + + return outseq + +def makefastq(seq, mutfreqs, seqerrorrate, readlength, depth, outfile): + #Given a wildtype amplicon (seq), paired end read length, desired depth, make simulated reads + #with desired mutation and sequencing error rates + #all quality scores are J (41) + readlength = int(readlength) + depth = int(depth) + + with gzip.open(outfile + '_1.fq.gz', 'wt') as read1outfh, gzip.open(outfile + '_2.fq.gz', 'wt') as read2outfh: + readcounter = 0 + for i in range(depth): + readcounter +=1 + if readcounter % 100000 == 0: + print('Creating read {0}...'.format(readcounter)) + + #Make fragment for this readpair + fragseq = makecDNAseq(seq, mutfreqs) + #Make reads from this fragment + read1seq = fragseq[:readlength] + read2seq = fragseq[readlength * -1:] + #reverse complement read2 + read2seq_rc = '' + for nt in read2seq: + nt_rc = revcomp(nt) + read2seq_rc += nt_rc + read2seq_rc = read2seq_rc[::-1] + + #Add sequencing errors + read1seq = addseqerrors(read1seq, seqerrorrate) + read2seq = addseqerrors(read2seq_rc, seqerrorrate) + + qualityscores = 'J' * len(read1seq) + readtitle = '@simread_' + str(readcounter) + + read1outfh.write(readtitle + '\n' + read1seq + '\n' + '+' + '\n' + qualityscores + '\n') + read2outfh.write(readtitle + '\n' + read2seq + '\n' + '+' + '\n' + qualityscores + '\n') + + with open('simulationparams.txt', 'w') as outfh: + outfh.write(('\t').join(['ref', 'mut', 'freq']) + '\n') + for ref in mutfreqs: + for mut in mutfreqs[ref]: + freq = str(mutfreqs[ref][mut]) + outfh.write(('\t').join([ref, mut, freq]) + '\n') + + +makefastq(seq, mutfreqs, seqerrorrate, sys.argv[1], sys.argv[2], 'oincsimulation') + + + diff --git a/snps.py b/src/pigpen/snps.py similarity index 92% rename from snps.py rename to src/pigpen/snps.py index 6374962..ee9a102 100644 --- a/snps.py +++ b/src/pigpen/snps.py @@ -9,16 +9,11 @@ import sys #This will take in a list of bams and identify variants, creating vcf files for each -def getSNPs(bams, genomefasta, minCoverage = 20, minVarFreq = 0.02): +def getSNPs(bams, genomefasta, minCoverage = 20, minVarFreq = 0.2): if not minCoverage: minCoverage = 20 if not minVarFreq: - minVarFreq = 0.02 - - #if we already made a vcf, don't make another one - if os.path.exists('merged.vcf'): - print('A merged vcf files already exists! Not making another one...') - return None + minVarFreq = 0.2 vcfFileNames = [] @@ -53,11 +48,10 @@ def getSNPs(bams, genomefasta, minCoverage = 20, minVarFreq = 0.02): with open('vcfconcat.log', 'w') as logfh: vcfFileNames = vcfFileNames * 2 vcfFiles = ' '.join(vcfFileNames) - print(vcfFiles) - concatCMD = 'bcftools merge --force-samples -m snps -O z --output merged.vcf ' + vcfFiles + concatCMD = 'bcftools merge --force-samples -m snps -O z --filter-logic x --output merged.vcf ' + vcfFiles concat = subprocess.Popen(concatCMD, shell = True, stderr = logfh) concat.wait() - filetorecord = vcfFileNames[0] + vcfFileNames = [vcfFileNames[0]] elif len(vcfFileNames) > 1: with open('vcfconcat.log', 'w') as logfh: vcfFiles = ' '.join(vcfFileNames) diff --git a/toy.bam b/toy.bam deleted file mode 100644 index 18b41f7..0000000 Binary files a/toy.bam and /dev/null differ diff --git a/workflow/config.yaml b/workflow/config.yaml new file mode 100755 index 0000000..a630652 --- /dev/null +++ b/workflow/config.yaml @@ -0,0 +1,41 @@ +samples: + ["ATP5MC1_Rep1_pDBF","ATP5MC1_Rep1_mDBF"] + +libtype: "LEXO" # "SA" or "LEXO" +dedupUMI: True # True or False +threads: 32 # Number of threads +STARindex: "/beevol/home/goeringr/Projects/OINC_seq/SSIV_tx_compare/LEXO/STARindex" +Salmonindex: "/beevol/home/goeringr/Projects/OINC_seq/SSIV_tx_compare/LEXO/transcripts32.idx" + +pigpen: + # path to files + pigpen: "/beevol/home/goeringr/Projects/OINC_seq/test/py2test/pigpen.py" + gff: "/beevol/home/goeringr/Annotations/hg38/gencode.v32.annotation.gff3.gz" + genomeFASTA: "/beevol/home/goeringr/Annotations/hg38/GRCh38.p13.genome.fa" + snpfile: "" + maskbed: "" + ROIbed: "" + # comma separated string + controlsamples: "ATP5MC1_Rep1_mDBF" + # parameter values + SNPcoverage: "20" + SNPfreq: "0.2" + nconv: "1" + minMappingQual: "60" + # output directory name + outputDir: "PIGPEN" + # space separated string + tags: "--useSNPs --use_g_t --use_g_c --onlyConsiderOverlap --use_read1 --use_read2 --dedupUMI" + +bacon: + # path to files + bacon: "/beevol/home/goeringr/Projects/OINC_seq/test/py2test/bacon_glm.py" + sampconds: + ["/beevol/home/goeringr/Projects/OINC_seq/OINC_MAVS/bacon_MAVS.txt"] + minreads: "" + conditionA: "mDBF" + conditionB: "pDBF" + tags: "--use_g_t --use_g_c --considernonG" + output: + ["MAVS.bacon.txt"] + diff --git a/workflow/rules/B2FQ.smk b/workflow/rules/B2FQ.smk new file mode 100644 index 0000000..f2788b9 --- /dev/null +++ b/workflow/rules/B2FQ.smk @@ -0,0 +1,48 @@ +""" +Rules for writing bam to paired fq files +For usage, include this in your workflow. +""" + + +if not config["dedupUMI"] and config["libtype"]=="LEXO": + rule STARbam2FQ: + """converts STAR aligned bam to paired fq""" + input: + bam=expand("PIGPEN_alignments/{sample}/STAR/{sample}Aligned.sortedByCoord.out.bam", sample=config["samples"]), + bai=expand("PIGPEN_alignments/{sample}/STAR/{sample}Aligned.sortedByCoord.out.bam.bai", sample=config["samples"]), + output: + sortedbam=temp(expand("PIGPEN_alignments/{sample}/STAR/{sample}.temp.namesort.bam", sample=config["samples"])), + fq1=temp(expand("PIGPEN_alignments/{sample}/STAR/{sample}.STARaligned.r1.fq.gz", sample=config["samples"])), + fq2=temp(expand("PIGPEN_alignments/{sample}/STAR/{sample}.STARaligned.r2.fq.gz", sample=config["samples"])), + threads: config["threads"] + run: + for bam,sortedbam,fq1,fq2 in zip(input.bam,output.sortedbam,output.fq1,output.fq2): + shell( + "samtools collate --threads {threads} -u -o {sortedbam} {bam}" + ) + shell( + "samtools fastq --threads {threads} -1 {fq1} -2 {fq2} " + "-0 /dev/null -s /dev/null -n {sortedbam}" + ) + +elif config["dedupUMI"] and config["libtype"]=="LEXO": + rule Dedupbam2FQ: + """converts deduplicated STAR bam to paired fq""" + input: + bam=expand("PIGPEN_alignments/{sample}/STAR/{sample}.dedup.bam", sample=config["samples"]), + bai=expand("PIGPEN_alignments/{sample}/STAR/{sample}.dedup.bam.bai", sample=config["samples"]), + output: + sortedbam=temp(expand("PIGPEN_alignments/{sample}/STAR/{sample}.temp.namesort.bam", sample=config["samples"])), + fq1=temp(expand("PIGPEN_alignments/{sample}/STAR/{sample}.dedup.r1.fq.gz", sample=config["samples"])), + fq2=temp(expand("PIGPEN_alignments/{sample}/STAR/{sample}.dedup.r2.fq.gz", sample=config["samples"])), + threads: config["threads"] + run: + for bam,sortedbam,fq1,fq2 in zip(input.bam,output.sortedbam,output.fq1,output.fq2): + shell( + "samtools collate --threads {threads} -u -o {sortedbam} {bam}" + ) + shell( + "samtools fastq --threads {threads} -1 {fq1} -2 {fq2} " + "-0 /dev/null -s /dev/null -n {sortedbam}" + ) + diff --git a/workflow/rules/STARalign.smk b/workflow/rules/STARalign.smk new file mode 100644 index 0000000..dab1ddb --- /dev/null +++ b/workflow/rules/STARalign.smk @@ -0,0 +1,29 @@ +""" +Rules for genome alignment with STAR +For usage, include this in your workflow. +""" + +rule runSTAR: + """aligns trimmed reads to genome with STAR""" + input: + read1=expand("cutadapt/{sample}_1.trimmed.fq.gz", sample=config["samples"]), + read2=expand("cutadapt/{sample}_2.trimmed.fq.gz", sample=config["samples"]), + STARindex=config["STARindex"], + params: + outprefix=expand("PIGPEN_alignments/{sample}/STAR/{sample}", sample=config["samples"]), + output: + bam=expand("PIGPEN_alignments/{sample}/STAR/{sample}Aligned.sortedByCoord.out.bam", sample=config["samples"]), + bai=expand("PIGPEN_alignments/{sample}/STAR/{sample}Aligned.sortedByCoord.out.bam.bai", sample=config["samples"]), + threads: config["threads"] + run: + for r1,r2,prefix,bam in zip(input.read1,input.read2,params.outprefix,output.bam): + shell( + "STAR --runMode alignReads --runThreadN {threads} " + "--genomeLoad NoSharedMemory --genomeDir {input.STARindex} " + "--readFilesIn {r1} {r2} --readFilesCommand zcat " + "--outFileNamePrefix {prefix} --outSAMtype BAM " + "SortedByCoordinate --outSAMstrandField intronMotif " + "--outSAMattributes MD NH --outSAMmultNmax 1" + ) + shell("samtools index {bam}") + diff --git a/workflow/rules/UMIdedup.smk b/workflow/rules/UMIdedup.smk new file mode 100644 index 0000000..9399551 --- /dev/null +++ b/workflow/rules/UMIdedup.smk @@ -0,0 +1,26 @@ +""" +Rules for UMI deduplication with UMI-tools +For usage, include this in your workflow. +""" + +if config["dedupUMI"]: + rule dedupSTAR: + """aligns trimmed reads to genome with STAR""" + input: + bam=expand("PIGPEN_alignments/{sample}/STAR/{sample}Aligned.sortedByCoord.out.bam", sample=config["samples"]), + bai=expand("PIGPEN_alignments/{sample}/STAR/{sample}Aligned.sortedByCoord.out.bam.bai", sample=config["samples"]), + + output: + bam=expand("PIGPEN_alignments/{sample}/STAR/{sample}.dedup.bam", sample=config["samples"]), + bai=expand("PIGPEN_alignments/{sample}/STAR/{sample}.dedup.bam.bai", sample=config["samples"]), + run: + for Inbam,Outbam in zip(input.bam,output.bam): + if config["libtype"]=="LEXO": + shell("umi_tools dedup -I {Inbam} --paired -S {Outbam}") + shell("samtools index {Outbam}") + elif config["libtype"]=="SA": + shell("umi_tools dedup -I {Inbam} --paired --method=unique -S {Outbam}") + shell("samtools index {Outbam}") + else: + print('LibType must be either "LEXO" or "SA".') + diff --git a/workflow/rules/UMIextract.smk b/workflow/rules/UMIextract.smk new file mode 100644 index 0000000..30e4914 --- /dev/null +++ b/workflow/rules/UMIextract.smk @@ -0,0 +1,32 @@ +""" +Rules for extraction of UMIs with UMI_tools +For usage, include this in your workflow. +""" + +rule UMIextract: + """extracts UMIs from raw fastq files""" + input: + read1=expand("RAWREADS/{sample}.R1.fq.gz", sample=config["samples"]), + read2=expand("RAWREADS/{sample}.R2.fq.gz", sample=config["samples"]), + output: + read1=expand("UMI_fastq/{sample}_R1.fq.gz", sample=config["samples"]), + read2=expand("UMI_fastq/{sample}_R2.fq.gz", sample=config["samples"]), + run: + if not config["libtype"]: + print("libtype must be included in the config file") + elif config["libtype"]: + for r1,r2,o1,o2 in zip(input.read1,input.read2,output.read1,output.read2): + if config["libtype"]=="LEXO": + shell( + "umi_tools extract -I {r1} " + "--bc-pattern=NNNNNN --read2-in={r2} " + "--stdout={o1} --read2-out={o2}" + ) + elif config["libtype"]=="SA": + shell( + "umi_tools extract -I {r2} " + "--bc-pattern=NNNNNNNNNNNN --read2-in={r1} " + "--stdout={o2} --read2-out={o1}" + ) + else: + print("libtype must be either 'LEXO' or 'SA'") diff --git a/workflow/rules/runBacon.smk b/workflow/rules/runBacon.smk new file mode 100644 index 0000000..ebef2a2 --- /dev/null +++ b/workflow/rules/runBacon.smk @@ -0,0 +1,27 @@ +""" +Rules for running bacon.py +For usage, include this in your workflow. +""" + +rule runBacon: + """runs Bacon with desired sample comparisons""" + input: + bacon=config["bacon"]["bacon"], + pigpen=expand("PIGPEN_alignments/{outDir}/{sample}.pigpen.txt", sample=config["samples"], outDir =config["pigpen"]["outputDir"]), + params: + SC=expand("{sampconds}", sampconds=config["bacon"]["sampconds"]), + MR=" --minreads " + config["bacon"]["minreads"] if config["bacon"]["minreads"] else "", + CA=" --conditionA " + config["bacon"]["conditionA"] if config["bacon"]["conditionA"] else "", + CB=" --conditionB " + config["bacon"]["conditionB"] if config["bacon"]["conditionB"] else "", + OT=" " + config["bacon"]["tags"] if config["bacon"]["tags"] else "", + OD=expand("PIGPEN_alignments/{outDir}/{output}", output=config["bacon"]["output"], outDir =config["pigpen"]["outputDir"]), + output: + expand("PIGPEN_alignments/{outDir}/{output}.bacon.txt", output=config["bacon"]["output"], outDir =config["pigpen"]["outputDir"]), + threads: config["threads"] + run: + for sampcond,output in zip(params.SC,params.OD): + shell( + "python -u {input.bacon} --sampconds {sampcond} " + "--output {output}{params.MR}{params.CA}{params.CB}{params.OT}" + ) + diff --git a/workflow/rules/runPM.smk b/workflow/rules/runPM.smk new file mode 100644 index 0000000..5f922c5 --- /dev/null +++ b/workflow/rules/runPM.smk @@ -0,0 +1,24 @@ +""" +Rules for alignment proability extraction with postmaster +For usage, include this in your workflow. +""" + +if config["libtype"]=="LEXO": + rule runPostmaster: + """extacts alignment probabilities from salmon bams and quantification""" + input: + sf=expand("PIGPEN_alignments/{sample}/salmon/{sample}.quant.sf", sample=config["samples"]), + bam=expand("PIGPEN_alignments/{sample}/salmon/{sample}.salmon.bam", sample=config["samples"]), + output: + bam=expand("PIGPEN_alignments/{sample}/postmaster/{sample}.postmaster.bam", sample=config["samples"]), + bai=expand("PIGPEN_alignments/{sample}/postmaster/{sample}.postmaster.bam.bai", sample=config["samples"]), + threads: config["threads"] + run: + for sf,Sbam,PMbam in zip(input.sf,input.bam,output.bam): + shell( + "postmaster --num-threads {threads} --quant {sf} " + "--alignments {Sbam} --output {PMbam}" + ) + shell("samtools sort -@ {threads} -o {PMbam} {PMbam}") + shell("samtools index {PMbam}") + diff --git a/workflow/rules/runPigpen.smk b/workflow/rules/runPigpen.smk new file mode 100644 index 0000000..9b18564 --- /dev/null +++ b/workflow/rules/runPigpen.smk @@ -0,0 +1,138 @@ +""" +Rules for running pigpen.py +For usage, include this in your workflow. +""" + +if config["libtype"]=="LEXO" and config["dedupUMI"]: + rule runPigpen_LEXO_UMI: + """runs Pigpen with desired parameters""" + input: + pigpen=config["pigpen"]["pigpen"], + dedupbam=expand("PIGPEN_alignments/{sample}/STAR/{sample}.dedup.bam", sample=config["samples"]), + dedupbai=expand("PIGPEN_alignments/{sample}/STAR/{sample}.dedup.bam.bai", sample=config["samples"]), + PMbam=expand("PIGPEN_alignments/{sample}/postmaster/{sample}.postmaster.bam", sample=config["samples"]), + PMbai=expand("PIGPEN_alignments/{sample}/postmaster/{sample}.postmaster.bam", sample=config["samples"]), + params: + samples=",".join(config["samples"]), + gff=" --gff " + config["pigpen"]["gff"] if config["pigpen"]["gff"] else "", + fa=" --genomeFasta " + config["pigpen"]["genomeFASTA"] if config["pigpen"]["genomeFASTA"] else "", + CS=" --controlsamples " + config["pigpen"]["controlsamples"] if config["pigpen"]["controlsamples"] else "", + VCF=" --snpfile " + config["pigpen"]["snpfile"] if config["pigpen"]["snpfile"] else "", + MB=" --maskbed " + config["pigpen"]["maskbed"] if config["pigpen"]["maskbed"] else "", + RB=" --ROIbed " + config["pigpen"]["ROIbed"] if config["pigpen"]["ROIbed"] else "", + SC=" --SNPcoverage " + config["pigpen"]["SNPcoverage"] if config["pigpen"]["SNPcoverage"] else "", + SF=" --SNPfreq " + config["pigpen"]["SNPfreq"] if config["pigpen"]["SNPfreq"] else "", + NC=" --nConv " + config["pigpen"]["nconv"] if config["pigpen"]["nconv"] else "", + MQ=" --minMappingQual " + config["pigpen"]["minMappingQual"] if config["pigpen"]["minMappingQual"] else "", + OT=" " + config["pigpen"]["tags"] if config["pigpen"]["tags"] else "", + OD=" --outputDir " + config["pigpen"]["outputDir"] if config["pigpen"]["outputDir"] else "", + output: + expand("PIGPEN_alignments/{outDir}/{sample}.pigpen.txt", sample=config["samples"], outDir =config["pigpen"]["outputDir"]), + threads: config["threads"] + run: + os.chdir('PIGPEN_alignments') + shell( + "python -u {input.pigpen} --samplenames {params.samples} --nproc {threads}" + "{params.gff}{params.fa}{params.CS}{params.VCF}{params.MB}{params.RB}" + "{params.SC}{params.SF}{params.NC}{params.MQ}{params.OT}{params.OD}" + ) + +elif config["libtype"]=="LEXO" and not config["dedupUMI"]: + rule runPigpen_LEXO: + """runs Pigpen with desired parameters""" + input: + pigpen=config["pigpen"]["pigpen"], + STARbam=expand("PIGPEN_alignments/{sample}/STAR/{sample}Aligned.sortedByCoord.out.bam", sample=config["samples"]), + STARbai=expand("PIGPEN_alignments/{sample}/STAR/{sample}Aligned.sortedByCoord.out.bam.bai", sample=config["samples"]), + PMbam=expand("PIGPEN_alignments/{sample}/postmaster/{sample}.postmaster.bam", sample=config["samples"]), + PMbai=expand("PIGPEN_alignments/{sample}/postmaster/{sample}.postmaster.bam", sample=config["samples"]), + params: + samples=",".join(config["samples"]), + gff=" --gff " + config["pigpen"]["gff"] if config["pigpen"]["gff"] else "", + fa=" --genomeFasta " + config["pigpen"]["genomeFASTA"] if config["pigpen"]["genomeFASTA"] else "", + CS=" --controlsamples " + config["pigpen"]["controlsamples"] if config["pigpen"]["controlsamples"] else "", + VCF=" --snpfile " + config["pigpen"]["snpfile"] if config["pigpen"]["snpfile"] else "", + MB=" --maskbed " + config["pigpen"]["maskbed"] if config["pigpen"]["maskbed"] else "", + RB=" --ROIbed " + config["pigpen"]["ROIbed"] if config["pigpen"]["ROIbed"] else "", + SC=" --SNPcoverage " + config["pigpen"]["SNPcoverage"] if config["pigpen"]["SNPcoverage"] else "", + SF=" --SNPfreq " + config["pigpen"]["SNPfreq"] if config["pigpen"]["SNPfreq"] else "", + NC=" --nConv " + config["pigpen"]["nconv"] if config["pigpen"]["nconv"] else "", + MQ=" --minMappingQual " + config["pigpen"]["minMappingQual"] if config["pigpen"]["minMappingQual"] else "", + OT=" " + config["pigpen"]["tags"] if config["pigpen"]["tags"] else "", + OD=" --outputDir " + config["pigpen"]["outputDir"] if config["pigpen"]["outputDir"] else "", + output: + expand("PIGPEN_alignments/{outDir}/{sample}.pigpen.txt", sample=config["samples"], outDir =config["pigpen"]["outputDir"]), + threads: config["threads"] + run: + os.chdir('PIGPEN_alignments') + shell( + "python -u {input.pigpen} --samplenames {params.samples} --nproc {threads}" + "{params.gff}{params.fa}{params.CS}{params.VCF}{params.MB}{params.RB}" + "{params.SC}{params.SF}{params.NC}{params.MQ}{params.OT}{params.OD}" + ) + +elif config["libtype"]=="SA" and config["dedupUMI"]: + rule runPigpen_SA_UMI: + """runs Pigpen with desired parameters""" + input: + pigpen=config["pigpen"]["pigpen"], + dedupbam=expand("PIGPEN_alignments/{sample}/STAR/{sample}.dedup.bam", sample=config["samples"]), + dedupbai=expand("PIGPEN_alignments/{sample}/STAR/{sample}.dedup.bam.bai", sample=config["samples"]), + params: + samples=",".join(config["samples"]), + gff=" --gff " + config["pigpen"]["gff"] if config["pigpen"]["gff"] else "", + fa=" --genomeFasta " + config["pigpen"]["genomeFASTA"] if config["pigpen"]["genomeFASTA"] else "", + CS=" --controlsamples " + config["pigpen"]["controlsamples"] if config["pigpen"]["controlsamples"] else "", + VCF=" --snpfile " + config["pigpen"]["snpfile"] if config["pigpen"]["snpfile"] else "", + MB=" --maskbed " + config["pigpen"]["maskbed"] if config["pigpen"]["maskbed"] else "", + RB=" --ROIbed " + config["pigpen"]["ROIbed"] if config["pigpen"]["ROIbed"] else "", + SC=" --SNPcoverage " + config["pigpen"]["SNPcoverage"] if config["pigpen"]["SNPcoverage"] else "", + SF=" --SNPfreq " + config["pigpen"]["SNPfreq"] if config["pigpen"]["SNPfreq"] else "", + NC=" --nConv " + config["pigpen"]["nconv"] if config["pigpen"]["nconv"] else "", + MQ=" --minMappingQual " + config["pigpen"]["minMappingQual"] if config["pigpen"]["minMappingQual"] else "", + OT=" " + config["pigpen"]["tags"] if config["pigpen"]["tags"] else "", + OD=" --outputDir " + config["pigpen"]["outputDir"] if config["pigpen"]["outputDir"] else "", + output: + expand("PIGPEN_alignments/{outDir}/{sample}.pigpen.txt", sample=config["samples"], outDir =config["pigpen"]["outputDir"]), + threads: config["threads"] + run: + os.chdir('PIGPEN_alignments') + shell( + "python -u {input.pigpen} --samplenames {params.samples} --nproc {threads}" + "{params.gff}{params.fa}{params.CS}{params.VCF}{params.MB}{params.RB}" + "{params.SC}{params.SF}{params.NC}{params.MQ}{params.OT}{params.OD}" + ) + +elif config["libtype"]=="SA" and not config["dedupUMI"]: + rule runPigpen_SA: + """runs Pigpen with desired parameters""" + input: + pigpen=config["pigpen"]["pigpen"], + STARbam=expand("PIGPEN_alignments/{sample}/STAR/{sample}Aligned.sortedByCoord.out.bam", sample=config["samples"]), + STARbai=expand("PIGPEN_alignments/{sample}/STAR/{sample}Aligned.sortedByCoord.out.bam.bai", sample=config["samples"]), + params: + samples=",".join(config["samples"]), + gff=" --gff " + config["pigpen"]["gff"] if config["pigpen"]["gff"] else "", + fa=" --genomeFasta " + config["pigpen"]["genomeFASTA"] if config["pigpen"]["genomeFASTA"] else "", + CS=" --controlsamples " + config["pigpen"]["controlsamples"] if config["pigpen"]["controlsamples"] else "", + VCF=" --snpfile " + config["pigpen"]["snpfile"] if config["pigpen"]["snpfile"] else "", + MB=" --maskbed " + config["pigpen"]["maskbed"] if config["pigpen"]["maskbed"] else "", + RB=" --ROIbed " + config["pigpen"]["ROIbed"] if config["pigpen"]["ROIbed"] else "", + SC=" --SNPcoverage " + config["pigpen"]["SNPcoverage"] if config["pigpen"]["SNPcoverage"] else "", + SF=" --SNPfreq " + config["pigpen"]["SNPfreq"] if config["pigpen"]["SNPfreq"] else "", + NC=" --nConv " + config["pigpen"]["nconv"] if config["pigpen"]["nconv"] else "", + MQ=" --minMappingQual " + config["pigpen"]["minMappingQual"] if config["pigpen"]["minMappingQual"] else "", + OT=" " + config["pigpen"]["tags"] if config["pigpen"]["tags"] else "", + OD=" --outputDir " + config["pigpen"]["outputDir"] if config["pigpen"]["outputDir"] else "", + output: + expand("PIGPEN_alignments/{outDir}/{sample}.pigpen.txt", sample=config["samples"], outDir =config["pigpen"]["outputDir"]), + threads: config["threads"] + run: + os.chdir('PIGPEN_alignments') + shell( + "python -u {input.pigpen} --samplenames {params.samples} --nproc {threads}" + "{params.gff}{params.fa}{params.CS}{params.VCF}{params.MB}{params.RB}" + "{params.SC}{params.SF}{params.NC}{params.MQ}{params.OT}{params.OD}" + ) + + diff --git a/workflow/rules/runSalmon.smk b/workflow/rules/runSalmon.smk new file mode 100644 index 0000000..63ffb1b --- /dev/null +++ b/workflow/rules/runSalmon.smk @@ -0,0 +1,47 @@ +""" +Rules for transcriptome alignment with Salmon +For usage, include this in your workflow. +""" + +if not config["dedupUMI"] and config["libtype"]=="LEXO": + rule runSalmon: + """aligns trimmed reads to transcriptome with Salmon""" + input: + read1=expand("PIGPEN_alignments/{sample}/STAR/{sample}.STARaligned.r1.fq.gz", sample=config["samples"]), + read2=expand("PIGPEN_alignments/{sample}/STAR/{sample}.STARaligned.r2.fq.gz", sample=config["samples"]), + Salmonindex=config["Salmonindex"], + params: + samplename=expand("PIGPEN_alignments/{sample}/salmon", sample=config["samples"]), + output: + sf=expand("PIGPEN_alignments/{sample}/salmon/{sample}.quant.sf", sample=config["samples"]), + bam=temp(expand("PIGPEN_alignments/{sample}/salmon/{sample}.salmon.bam", sample=config["samples"])), + threads: config["threads"] + run: + for r1,r2,samplename,bam in zip(input.read1,input.read2,params.samplename,output.bam): + shell( + "salmon quant --libType A -p {threads} --seqBias --gcBias " + "--validateMappings -1 {r1} -2 {r2} -o {sample} " + "--index {input.Salmonindex} --writeMappings={bam} --writeQualities" + ) + +elif config["dedupUMI"] and config["libtype"]=="LEXO": + rule runSalmonDedup: + """aligns trimmed reads to transcriptome with Salmon""" + input: + read1=expand("PIGPEN_alignments/{sample}/STAR/{sample}.dedup.r1.fq.gz", sample=config["samples"]), + read2=expand("PIGPEN_alignments/{sample}/STAR/{sample}.dedup.r2.fq.gz", sample=config["samples"]), + Salmonindex=config["Salmonindex"], + params: + samplename=expand("PIGPEN_alignments/{sample}/salmon", sample=config["samples"]), + output: + sf=expand("PIGPEN_alignments/{sample}/salmon/{sample}.quant.sf", sample=config["samples"]), + bam=temp(expand("PIGPEN_alignments/{sample}/salmon/{sample}.salmon.bam", sample=config["samples"])), + threads: config["threads"] + run: + for r1,r2,samplename,sf,bam in zip(input.read1,input.read2,params.samplename,output.sf,output.bam): + shell( + "salmon quant --libType A -p {threads} --seqBias --gcBias " + "--validateMappings -1 {r1} -2 {r2} -o {samplename} " + "--index {input.Salmonindex} --writeMappings={bam} --writeQualities" + ) + shell("mv {samplename}/quant.sf {sf}") diff --git a/workflow/rules/trimming.smk b/workflow/rules/trimming.smk new file mode 100644 index 0000000..cd3b486 --- /dev/null +++ b/workflow/rules/trimming.smk @@ -0,0 +1,101 @@ +""" +Rules for trimming NGS reads with cutadapt +(http://cutadapt.readthedocs.org/en/latest/guide.html#illumina-truseq) +For usage, include this in your workflow. + +Quantseq samples +Read1, trim AAAAAAAAAAAAAAAAAAAA from 3' end +Read2, trim AGATCGGAAGAGCGTCGTGTAGGGAAAGACGGTA from 3' end and TTTTTTTTTTTTTTTTTTTT from 5' end + +NEW QUANTSEQ STRATEGY +Step1: get rid of random hex bindng site (first 6 nt of read 1) and trim 3' adapter off read 1 (-u 6 ; -a AAAAAAAAAAAAAAAAAAAA) [make temporary output files] +Step2: Trim 5' adapter of read 2 (TTTTTTTTTTTTTTTTTTTT) [make temporary output files] +Step3: Try to trim 3' adapter of read 2 (AGATCGGAAGAGCGTCGTGTAGGGAAAGACGGTA). Write untrimmed reads (these are done). [make temporary outfile for trimmed reads] +Step4: For reads that did have 3' adapter on read 2, remove the last 6 bases on read 2 (UMI). These are now done too. +Step5: Combine trimmed reads from step4 with untrimmed reads from step3. + +SINGLE AMPLICON (SA) STRATEGY +Step1: get rid of primer binding sites (20nt) + internal barcode (6nt) +""" +if not config["libtype"]: + print("libtype must be included in the config file") + +elif config["libtype"]=="LEXO": + rule cutadapt_LEXO: + """Trims given paired-end reads with given parameters""" + input: + read1=expand("UMI_fastq/{sample}_R1.fq.gz", sample=config["samples"]), + read2=expand("UMI_fastq/{sample}_R2.fq.gz", sample=config["samples"]), + output: + outread1s1=temp(expand("cutadapt/{sample}_1.temp.step1.fq.gz", sample=config["samples"])), + outread2s1=temp(expand("cutadapt/{sample}_2.temp.step1.fq.gz", sample=config["samples"])), + statsouts1=expand("cutadapt/{sample}.cutadaptstats.step1.txt", sample=config["samples"]), + outread1s2=temp(expand("cutadapt/{sample}_1.temp.step2.fq.gz", sample=config["samples"])), + outread2s2=temp(expand("cutadapt/{sample}_2.temp.step2.fq.gz", sample=config["samples"])), + statsouts2=expand("cutadapt/{sample}.cutadaptstats.step2.txt", sample=config["samples"]), + outread1s3=temp(expand("cutadapt/{sample}_1.temp.step3.fq.gz", sample=config["samples"])), + outread2s3=temp(expand("cutadapt/{sample}_2.temp.step3.fq.gz", sample=config["samples"])), + statsouts3=expand("cutadapt/{sample}.cutadaptstats.step3.txt", sample=config["samples"]), + untrimmedr1s3=temp(expand("cutadapt/{sample}_1.untrimmed.step3.fq.gz", sample=config["samples"])), + untrimmedr2s3=temp(expand("cutadapt/{sample}_2.untrimmed.step3.fq.gz", sample=config["samples"])), + outread1s4=temp(expand("cutadapt/{sample}_1.temp.step4.fq.gz", sample=config["samples"])), + outread2s4=temp(expand("cutadapt/{sample}_2.temp.step4.fq.gz", sample=config["samples"])), + statsouts4=expand("cutadapt/{sample}.cutadaptstats.step4.txt", sample=config["samples"]), + finaloutread1=expand("cutadapt/{sample}_1.trimmed.fq.gz", sample=config["samples"]), + finaloutread2=expand("cutadapt/{sample}_2.trimmed.fq.gz", sample=config["samples"]), + threads: config["threads"] + run: + for Ir1,Ir2,O1r1,O1r2,S1,O2r1,O2r2,S2,O3r1,O3r2,S3,U3r1,U3r2,O4r1,O4r2,S4,OFr1,OFr2 in zip(input.read1,input.read2,output.outread1s1,output.outread2s1,output.statsouts1,output.outread1s2,output.outread2s2,output.statsouts2,output.outread1s3,output.outread2s3,output.statsouts3,output.untrimmedr1s3,output.untrimmedr2s3,output.outread1s4,output.outread2s4,output.statsouts4,output.finaloutread1,output.finaloutread2): + #Step1 + shell( + "cutadapt -u 6 -U 0 -a AAAAAAAAAAAAAAAAAAAA --minimum-length 25 " + "-j {threads} -o {O1r1} -p {O1r2} {Ir1} {Ir2} > {S1}" + ) + + #Step2 + shell( + "cutadapt -G TTTTTTTTTTTTTTTTTTTT --minimum-length 25 " + "-j {threads} -o {O2r1} -p {O2r2} {O1r1} {O1r2} > {S2}" + ) + + #Step3 + shell( + "cutadapt -A AGATCGGAAGAGCGTCGTGTAGGGAAAGACGGTA --minimum-length 25 " + "-j {threads} --untrimmed-output {U3r1} " + "--untrimmed-paired-output {U3r2} -o {O3r1} -p {O3r2} " + "{O2r1} {O2r2} > {S3}" + ) + + #Step4 + shell( + "cutadapt -U -6 --minimum-length 25 -j {threads} -o {O4r1} " + "-p {O4r2} {O3r1} {O3r2} > {S4}" + ) + + #Step5 + shell("cat {U3r1} {O4r1} > {OFr1}") + shell("cat {U3r2} {O4r2} > {OFr2}") + + +elif config["libtype"]=="SA": + rule cutadapt_SA: + """Trims given paired-end reads with given parameters""" + input: + read1=expand("UMI_fastq/{sample}_R1.fq.gz", sample=config["samples"]), + read2=expand("UMI_fastq/{sample}_R2.fq.gz", sample=config["samples"]), + output: + finaloutread1=expand("cutadapt/{sample}_1.trimmed.fq.gz", sample=config["samples"]), + finaloutread2=expand("cutadapt/{sample}_2.trimmed.fq.gz", sample=config["samples"]), + statsouts=expand("cutadapt/{sample}.cutadaptstats.txt", sample=config["samples"]), + threads: config["threads"] + run: + for Ir1,Ir2,S1,OFr1,OFr2 in zip(input.read1,input.read2,output.statsouts,output.finaloutread1,output.finaloutread2): + shell( + "cutadapt -u 26 -U 26 --minimum-length 25 " + "-j {threads} -o {OFr1} -p {OFr2} {Ir1} {Ir2} > {S1}" + ) + + +else: + print("libtype must be either 'LEXO' or 'SA'") + diff --git a/workflow/snakefile b/workflow/snakefile new file mode 100644 index 0000000..2179294 --- /dev/null +++ b/workflow/snakefile @@ -0,0 +1,17 @@ +configfile: "config/config.yaml" + +include: "rules/UMIextract.smk" +include: "rules/trimming.smk" +include: "rules/STARalign.smk" +include: "rules/UMIdedup.smk" +include: "rules/B2FQ.smk" +include: "rules/runSalmon.smk" +include: "rules/runPM.smk" +include: "rules/runPigpen.smk" +include: "rules/runBacon.smk" + +rule all: + input: + #expand("PIGPEN_alignments/{outDir}/{sample}.pigpen.txt", sample=config["samples"], outDir=config["pigpen"]["outputDir"]), + expand("PIGPEN_alignments/{outDir}/{sample}.bacon.txt", sample=config["bacon"]["output"], outDir=config["pigpen"]["outputDir"]), +