Transcriptomics

Qulaity control

Create the channel for qc to install fastp fastqc multiqc install

Step:1
Tool installation

# create env and install tools
$ conda create --yes -n qc fastp fastqc multiqc

# activate env
$ conda activate qc

Step: 2

conda create --yes -n qc fastp fastqc multiqc

----------------------------

vi run_qc.sh


#!/bin/bash

set -e


#--Activate the qc environment

source activate qc


#-- Check the reads quality

cd ~/analysis/015_seed_transcriptomics/01.quality_preprocessing/raw

fastqc -t 32 -o ~/analysis/015_seed_transcriptomics/01.quality_preprocessing/raw *.fastq.gz &> log.fastqc &

#---Trimming

#!/bin/bash

set -e

for reads in *_1.fastq.gz

do 

base=$(basename $reads _1.fastq.gz)

fastp --detect_adapter_for_pe \
       --overrepresentation_analysis \
       --correction --cut_right --thread 2 \
       --html ../trim/${base}.fastp.html --json ../trim/${base}.fastp.json \
       -i ${base}_1.fastq.gz -I ${base}_2.fastq.gz \
       -o ../trim/${base}_R1.fastq.gz -O ../trim/${base}_R2.fastq.gz 

done
#----Run
bash run_fastp.sh &>log.fastp &

#---Let's run fastqc again on the trimmed data

cd ../trim/

fastqc -t 32 *

#--Let's run multiQC on both untrimmed and trimmed files

cd ..

multiqc raw trim 

source deactivate qc

#--Bye


$ bash run_qc.sh &> log.qc &

Note:

--detect_adapter_for_pe: Specifies that we are dealing with paired-end data.

--overrepresentation_analysis: Analyse the sequence collection for sequences that appear too often.

--correction: Will try to correct bases based on an overlap analysis of read1 and read2.

--cut_right: Will use quality trimming and scan the read from start to end in a window. If the quality in the window is below what is required, the window plus all sequence towards the end is discarded and the read is kept if its still long enough.

--thread: Specify how many concurrent threads the process can use.

--html and --json: We specify the location of some stat files.

-i data/anc_R1.fastq.gz -I data/anc_R2.fastq.gz: Specifies the two input read files

-o trimmed/anc_R1.fastq.gz -O trimmed/anc_R2.fastq.gz: Specifies the two desired output read files

Blast2GO tutorial

Use this command if you have InterProScan result (make sure you edit the annotation.prop file, change the value next to InterProScanImportParameters.inputFormat):

/usr/local/blast2go/blast2go_cli.run -properties annotation.prop -useobo go.obo -loadblast blastresults.xml -loadips50 ipsout.xml -mapping -annotation  -annex -statistics all -saveb2g myresult -saveannot myresult -savereport myresult -tempfolder ./ >& annotatelogfile &

Very important file format problem

The output file from BLAST2GO myresult.annot is formatted for Windows PC. It will cause problems if you use it on Linux computer. Make sure to convert the file to LINUX line ends for this file. The command is:

dos2unix myresult.annot

Transcriptome Quality Assessment

BUSCO

#!/bin/bash

set -e

cd /NABIC/HOME/senthil83/analysis/009_Assembly_quality_metrics_s_alatum

ln -fs /NABIC/HOME/senthil83/analysis/004_denovo_assembly_s_alatum_rna/s_alatum_trinity_out/Trinity.fasta .


echo "Activate the busco environment"

source activate busco_env

echo "BUSCO analysis for S_alatum" 

busco \
  -i Trinity.fasta \
  -o busco_S_alatum_embryophyta \
  -l  /NABIC/HOME/senthil83/analysis/009_Assembly_quality_metrics_s_alatum/eudicots_odb10.2020-09-10/eudicots_odb10 \
  -m transcriptome 


#deactivate busco environment
source deactivate busco_env


bash run.sh &> log &

#exit

References

Transcriptome Assembly Quality Assessment

Trinotate

Prototype Ref1 Githubref2

A Tissue-Mapped Axolotl De Novo TranscriptomeEnables Identification of Limb Regeneration Factors

Diamond

DIAMOND

##Install DIAMOND BLASTX tool https://anaconda.org/bioconda/diamond

##create specific environment for diamond conda create --name diamond_env

source activate diamond_env

conda install -c bioconda diamond conda install -c bioconda/label/cf201901 diamond

https://github.com/bbuchfink/diamond

downloading the tool

wget http://github.com/bbuchfink/diamond/releases/download/v2.0.8/diamond-linux64.tar.gz tar xzf diamond-linux64.tar.gz

creating a diamond-formatted database file

./diamond makedb --in reference.fasta -d reference

running a search in blastp mode

./diamond blastp -d reference -q queries.fasta -o matches.tsv

running a search in blastx mode

./diamond blastx -d reference -q reads.fasta -o matches.tsv

downloading and using a BLAST database

update_blastdb --decompress --blastdb_version 5 swissprot ./diamond blastp -d swissprot -q queries.fasta -o matches.tsv

##Making database###

/usr/bin/time -o output.txt -v diamond makedb -p 32 --in uniprot_sprot.fasta -d uniprot_sprot &>log &

/usr/bin/time -o outputswissprot.txt -v diamond makedb -p 32 --in swissprot.fasta -d swissprot &>log.swissprt &

/usr/bin/time -o outputnr.txt -v diamond makedb -p 32 --in nr.fasta -d nr &>log.nrdb &

##Run Blastx##

/usr/bin/time -o outputblastx.txt -v diamond blastx -p 32 -d uniprot_sprot -q alatum.transcripts.fasta -o alatum_matches.xml --outfmt 5 &> log.blastx.xml &

/usr/bin/time -o output_sprotblastx.txt -v diamond blastx -p 32 -d swissprot -q alatum.transcripts.fasta -o swissprot.alatum_matches.xml --outfmt 5 &> log.swisprot.blastx.xml &

/usr/bin/time -o output_nrblastx.txt -v diamond blastx -p 32 -sensitive -d nr -q alatum.transcripts.fasta -o nr.alatum_matches.xml --outfmt 5 &> log.nr.blastx.xml &

CDD DATABASE

NUCELOTIDE QUERY REF Biostar_help Biopython

GO ENRICHMENT using R

GOseq_R

AUGUSTUS GENE PREDICTION TOOL

GENOMICS TUTORIAL

/data/www/augustus/augustus/bin/augustus --species=arabidopsis --strand=both --singlestrand=false --genemodel=partial --codingseq=on --sample=100 --keep_viterbi=true --alternatives-from-sampling=true --minexonintronprob=0.2 --minmeanexonintronprob=0.5 --maxtracks=2 /data/www/augustus/webservice/data/AUG-1481299735/input.fa --exonnames=on

Using Sequence_IDS pooled from trasncript_dataset

$ seqtk subseq your.input.fasta the_header_of_interest_IDs.list > your_output.fasta

Sequence Length distribution

$ cat file.fa | awk '$0 ~ ">" {print c; c=0;printf substr($0,2,100) "\t"; } $0 !~ ">" {c+=length($0);} END { print c; }'

How to generate sequence length distribution from Fasta file

PFAM

For PFAM annotations, I used hmmscan against the full database downloaded from the xfam page. To download the databank and create the hmmscan file:

wget ftp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.gz #download

$ gunzip Pfam-A.hmm.gz #unzip the database

$ hmmpress Pfam-A.hmm #parse the database

And to run the scan against the query sequences:

$ hmmscan --tblout MySu_v1.PFAM.txt Pfam-A.hmm MySu01_v1.proteins.fasta

pfam_scan_env

#download

wget http://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.dat.gz for Pfam-B.hmm.dat wget http://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam33.1/active_site.dat.gz for active_site.dat

##after download

hmmpress Pfam-A.hmm hmmpress Pfam-B.hmm

pfam_scan.pl -fasta ourproteinsequence.fasta -cpu 32 -outfile outputfilename.txt - as -dir thewholepathofthedirectorofdb

pfam_scan.pl -fasta suwon.blastpep.fasta -cpu 32 -outfile suwon.hmmresults.txt - as -dir /NABIC/HOME/senthil10/datafiles/05.pfam_db

InterproScan

Download

To install and run InterProScan

To Run

./interproscan.sh -appl CDD,COILS,Gene3D,HAMAP,MobiDBLite,PANTHER,Pfam,PIRSF,PRINTS,PROSITEPATTERNS,PROSITEPROFILES,SFLD,SMART,SUPERFAMILY,TIGRFAM -i /path/to/sequences.fasta

Gene Enrichment Analysis

Gene Set Enrichment Analysis with ClusterProfiler

RNA-seq analysis in R

COG classification_Draw by R


library(ggplot2)


dat <- data.frame(
  FunctionClass = factor(c("A", "B", "C", "D", "E", "F", "G", "H", "I",     "J", "K", "L", "M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "Y", "Z"), levels=c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "Y", "Z")),
  legend = c("A: RNA processing and modification", "B: Chromatin structure and dynamics", "C: Energy production and conversion", "D: Cell cycle control, cell division, chromosome partitioning", "E: Amino acid transport and metabolism", "F: Nucleotide transport and metabolism", "G: Carbohydrate transport and metabolism", "H: Coenzyme transport and metabolism", "I: Lipid transport and metabolism", "J: Translation, ribosomal structure and biogenesis", "K: Transcription", "L: Replication, recombination and repair", "M: Cell wall/membrane/envelope biogenesis", "N: Cell motility", "O: Posttranslational modification, protein turnover, chaperones", "P: Inorganic ion transport and metabolism", "Q: Secondary metabolites biosynthesis, transport and catabolism", "R: General function prediction only", "S: Function unknown", "T: Signal transduction mechanisms", "U: Intracellular trafficking, secretion, and vesicular transport", "V: Defense mechanisms", "W: Extracellular structures", "Y: Nuclear structure", "Z: Cytoskeleton"),
  Frequency=c(1516,232,1011,257,934,341,1689,393,1019,1395,2777,862,354,1,2548,849,960,0,8789,3423,1226,258,25,15,476)
)


p <- ggplot(data=dat, aes(x=FunctionClass, y=Frequency, fill=legend))+
  
  geom_bar(stat="identity", position=position_dodge(), colour="seashell")

p + guides (fill = guide_legend(ncol = 1))

p <- ggplot(data=dat, aes(x=FunctionClass, y=No.of.Transcripts, fill=legend))+
  geom_bar(stat="identity", position=position_dodge(), colour="seashell")
p + guides (fill = guide_legend(ncol = 1))+
  xlab("Factor Class")+
  ggtitle("Cluster of Orthologous")

SSR marker prediction_KRAIT

KRAIT

[Krait: an ultrafast tool for genome-wide survey of microsatellites and primer design] (https://academic.oup.com/bioinformatics/article/34/4/681/4557187)

Du, L., Zhang, C., Liu, Q., Zhang, X. & Yue, B. Krait: an ultrafast tool for genome-wide survey of microsatellites and primer design. Bioinformatics 34, 681–683 (2018).

Distribution patterns and variation analysis of simple sequence repeats in different genomic regions of bovid genomes Download PDF

Genome-wide investigation of microsatellite polymorphism in coding region of the giant panda (Ailuropoda melanoleuca) genome: a resource for study of phenotype diversity and abnormal traits