Codon Usage Bias from RNA-sequencing data (CUBseq) is a fully automatic pipeline that produces robust estimates of codon usage frequencies at the transcriptome level. CUBseq can be used for any organism with an NCBI taxonomy ID, available RNA-sequencing data and a reference genome/annotation. The end result is a dataset of transcriptome-wide sequences with variants built in, allowing CUBseq to provide codon relative frequencies as well as raw counts at codon and amino acid resolution for custom downstream codon usage analysis.
- Large-scale transcriptome-wide codon usage analysis.
- Generation of transcriptome-derived codon usage tables (expressed as relative frequency and frequency per thousand).
- Quantification of transcriptome-wide genes.
- Robust identification of high expression genes.
- Reconstruction of transcriptomes per sample using variant calls.
- Analysis of mutation frequency per sample across the transcriptome and at gene level.
- Comparison of codon frequency with tRNA abundance.
Note
Before running the workflow, you will need to have Nextflow installed. See instructions on how to here.
nextflow pull stracquadaniolab/cubseq-nf -r mainA nextflow.config configuration file will need to be created where parameters are defined, as specified below in Configuring CUBseq. This configuration file will need to be created in the same directory where the pipeline will be run. An example configuration file is provided in example-nextflow.config.
Assuming the configuration file is set, to run CUBseq, the bare minimum command required is:
nextflow run stracquadaniolab/cubseq-nf -r main -profile singularity -c conf/nextflow.configAlternatively, you can define parameters and call custom profiles (examples available on example-nextflow.config) directly in the nextflow run command:
nextflow run stracquadaniolab/cubseq-nf -r main -profile singularity,cell -c conf/nextflow.config --resultsDir ./results/test-runFor example, here we call a profile, cell, which we defined in our config file (which we used to specify the executor, RAM/CPU requirements and error strategy for each process). We also specify a custom results directory path to save output files to.
To run CUBseq you will need to specify a number of paths for storing
results, and provide appropriate parameter options based on the
organism being analysed. These parameters need to be defined in a
configuration file called nextflow.config. Required parameters are
indicated with an asterisk, the rest of the parameters are optional.
| Parameter | Description |
|---|---|
resultsDir |
Directory where all results are stored [default: "./results/"]. |
| Paths to genome files | |
genome.reference * |
Path to genome reference (fasta) file [example: "data/genome/ecoli.fa"]. |
genome.annotation * |
Path to genome annotation (GTF/GFF/GFF3) file [example: "data/genome/ecoli.gff"]. |
| ENA metadata retrieval parameters | |
taxonId * |
NCBI taxonomy ID of organism to be analysed [default: "562"]. |
limitSearch |
Limit number of records output from ENA search query [default: 0]. |
removeRun |
Remove run by specifying its run accession [default: "NULL", example: "SRR13894889"]. |
max_sra_bytes |
Specify runs to remove if they exceed size of sra_bytes [default: "55000000000"]. |
dateMin |
Set minimum date (YYYY/MM/DD) to filter runs by (inclusive) [default: "1950-01-01"]. |
dateMax |
Set maximum date (YYYY/MM/DD) to filter runs by (inclusive), uses current date by default [default: "FALSE"]. |
| STAR align parameters | |
star.sjdbOverhang |
The "--sjdbOverhang" option of STAR, specifies length of genomic sequence on each side of the junctions, refer to STAR documentation for more detail. Here, we use STAR's default option [default: "100"]. |
star.genomeSAindexNbases * |
The "--genomeSAindexNbases" option of STAR, specifying the length (bases) of SA pre-indexing string. This must be scaled down for small genomes, using formula: min(14, log2(GenomeLength)/2 - 1). [default: "10"]. |
star.alignIntronMax |
The "--alignIntronMax" option of STAR, specifying maximum intron size [default: "1".] |
star.limitBAMsortRAM |
The "--limitBAMsortRAM" option of STAR, specifying maximum available RAM (bytes) [default: "2342750981"]. |
star.outBAMsortingBinsN |
The "--outBAMsortingBinsN" option of STAR, specifying the number of genome bins for coordinate-sorting [default: "50"]. |
| featureCounts parameters | |
featureCounts.type.feature |
The "-t" option of featureCounts, specifying feature type(s) in a GTF annotation to be used for read mapping. Multiple types should be separated by "," with no space in between [default: "exon"]. |
featureCounts.type.attribute |
The "-g" option of featureCounts, specifying attribute type in the GTF annotation [default" "gene_id"]. |
| Freebayes parameters | |
freebayes.ploidy * |
The "--ploidy" option of Freebayes, specifying the default ploidy for the organism used in the analysis. [default: "1"]. |
freebayes.args |
Additional Freebayes arguments, refer to their documentation [default: ""]. |
| bcftools parameters | |
bcftools.filter_vcf.args |
Additional bcftools filter arguments for filtering the VCF file, refer to their documentation [default: 'QUAL>20 && TYPE="snp"', note the use of quotation marks here]. |
| Salmon indexing parameters | |
salmon.index.args |
Additional arguments for salmon indexing, refer to their documentation [default: ""]. |
| Salmon quantification parameters | |
salmon.quant.libtype |
The "--libType" option of Salmon quant, specifying library type, CUBseq sets this to "Automatic" detection by default. Refer to their documentation for more information [default: "A"]. |
salmon.quant.args |
Additional arguments for salmon quant, refer to their documentation [example: "--writeUnmappedNames"]. |
| tximport parameters | |
summarize_to_gene. counts_from_abundance |
Generate counts from abundances in tximport [default: "no"]. |
CUBseq results are stored in the following directories:
results/metadata/metadata.csv: file containing the ENA metadata of RNA sequencing runs.results/bams/: directory containing the bam files, as processed by STAR.results/featureCounts/: directory containing featureCounts gene quantification results per sample and summary statistics.results/freebayes-vcf/: directory containing vcf files, as processed by Freebayes.results/vcf/: directory containing filtered vcf files, as processed by bcftools norm and bcftools filter.results/transcriptome-consensus/: directory containing consensus transcriptomes in fasta format.results/wt-transcriptome/: directory containing the wild-type transcriptome, as generated by gffread.results/mut-transcriptome/: directory containing the reconstructed mutated transcriptomes per sequencing run, as processed by gffread.results/salmon-quant/: directory containing gene abundance results per sequencing run, as processed by salmon quantification.results/dataset/: directory containing the tximport RDS file that sumamrises salmon quantification results at the gene-level (expressed as TPM matrix).results/gene-rank-analysis/: directory containing results of CUBseq's gene rank analysis.results/heg-mut-transcriptome/: directory of fasta files per sequencing run, containing only highly expressed genes.results/protein-mut-transcriptome/: directory of fasta files per sequencing run, containing transcriptome-wide (i.e. all protein-coding) gemes.results/cu-data/: directory containing codon usage count data for highly expressed genes, protein coding genes, as well as from the Kazusa and CoCoPUTs databases (if available).results/summarise-codon-counts/: directory containing codon counts summarised at codon and amino acid resolution.
- Anima Sutradhar (A.Sutradhar@sms.ed.ac.uk): developer and maintainer.
- Giovanni Stracquadanio (giovanni.stracquadanio@ed.ac.uk): principal investigator.
If you have any questions, issues or feature requests, please get in touch using the emails above or posting an Issue.
