Skip to content

seninfobio/Transcriptomics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

98 Commits
 
 
 
 
 
 

Repository files navigation

Transcriptomics

Qulaity control

Reference site

Create the channel for qc to install fastp fastqc multiqc install

  • Step:1
  • Tool installation
# create env and install tools
$ conda create --yes -n qc fastp fastqc multiqc

# activate env
$ conda activate qc
  • Step: 2
conda create --yes -n qc fastp fastqc multiqc

----------------------------

vi run_qc.sh


#!/bin/bash

set -e


#--Activate the qc environment

source activate qc


#-- Check the reads quality

cd ~/analysis/015_seed_transcriptomics/01.quality_preprocessing/raw

fastqc -t 32 -o ~/analysis/015_seed_transcriptomics/01.quality_preprocessing/raw *.fastq.gz &> log.fastqc &

#---Trimming

#!/bin/bash

set -e

for reads in *_1.fastq.gz

do 

base=$(basename $reads _1.fastq.gz)

fastp --detect_adapter_for_pe \
       --overrepresentation_analysis \
       --correction --cut_right --thread 2 \
       --html ../trim/${base}.fastp.html --json ../trim/${base}.fastp.json \
       -i ${base}_1.fastq.gz -I ${base}_2.fastq.gz \
       -o ../trim/${base}_R1.fastq.gz -O ../trim/${base}_R2.fastq.gz 

done
#----Run
bash run_fastp.sh &>log.fastp &

#---Let's run fastqc again on the trimmed data

cd ../trim/

fastqc -t 32 *

#--Let's run multiQC on both untrimmed and trimmed files

cd ..

multiqc raw trim 

source deactivate qc

#--Bye


$ bash run_qc.sh &> log.qc &
        

Note:

--detect_adapter_for_pe: Specifies that we are dealing with paired-end data.

--overrepresentation_analysis: Analyse the sequence collection for sequences that appear too often.

--correction: Will try to correct bases based on an overlap analysis of read1 and read2.

--cut_right: Will use quality trimming and scan the read from start to end in a window. If the quality in the window is below what is required, the window plus all sequence towards the end is discarded and the read is kept if its still long enough.

--thread: Specify how many concurrent threads the process can use.

--html and --json: We specify the location of some stat files.

-i data/anc_R1.fastq.gz -I data/anc_R2.fastq.gz: Specifies the two input read files

-o trimmed/anc_R1.fastq.gz -O trimmed/anc_R2.fastq.gz: Specifies the two desired output read files

Use this command if you have InterProScan result (make sure you edit the annotation.prop file, change the value next to InterProScanImportParameters.inputFormat):

/usr/local/blast2go/blast2go_cli.run -properties annotation.prop -useobo go.obo -loadblast blastresults.xml -loadips50 ipsout.xml -mapping -annotation  -annex -statistics all -saveb2g myresult -saveannot myresult -savereport myresult -tempfolder ./ >& annotatelogfile &

Very important file format problem

The output file from BLAST2GO myresult.annot is formatted for Windows PC. It will cause problems if you use it on Linux computer. Make sure to convert the file to LINUX line ends for this file. The command is:

dos2unix myresult.annot

Transcriptome Quality Assessment

#!/bin/bash

set -e

cd /NABIC/HOME/senthil83/analysis/009_Assembly_quality_metrics_s_alatum

ln -fs /NABIC/HOME/senthil83/analysis/004_denovo_assembly_s_alatum_rna/s_alatum_trinity_out/Trinity.fasta .


echo "Activate the busco environment"

source activate busco_env

echo "BUSCO analysis for S_alatum" 

busco \
  -i Trinity.fasta \
  -o busco_S_alatum_embryophyta \
  -l  /NABIC/HOME/senthil83/analysis/009_Assembly_quality_metrics_s_alatum/eudicots_odb10.2020-09-10/eudicots_odb10 \
  -m transcriptome 


#deactivate busco environment
source deactivate busco_env


bash run.sh &> log &

#exit

  • References

Transcriptome Assembly Quality Assessment

Trinotate

Prototype Ref1 Githubref2

A Tissue-Mapped Axolotl De Novo TranscriptomeEnables Identification of Limb Regeneration Factors

Diamond

DIAMOND

##Install DIAMOND BLASTX tool https://anaconda.org/bioconda/diamond

##create specific environment for diamond conda create --name diamond_env

source activate diamond_env

conda install -c bioconda diamond conda install -c bioconda/label/cf201901 diamond

https://github.com/bbuchfink/diamond

downloading the tool

wget http://github.com/bbuchfink/diamond/releases/download/v2.0.8/diamond-linux64.tar.gz tar xzf diamond-linux64.tar.gz

creating a diamond-formatted database file

./diamond makedb --in reference.fasta -d reference

running a search in blastp mode

./diamond blastp -d reference -q queries.fasta -o matches.tsv

running a search in blastx mode

./diamond blastx -d reference -q reads.fasta -o matches.tsv

downloading and using a BLAST database

update_blastdb --decompress --blastdb_version 5 swissprot ./diamond blastp -d swissprot -q queries.fasta -o matches.tsv

##Making database###

/usr/bin/time -o output.txt -v diamond makedb -p 32 --in uniprot_sprot.fasta -d uniprot_sprot &>log &

/usr/bin/time -o outputswissprot.txt -v diamond makedb -p 32 --in swissprot.fasta -d swissprot &>log.swissprt &

/usr/bin/time -o outputnr.txt -v diamond makedb -p 32 --in nr.fasta -d nr &>log.nrdb &

##Run Blastx##

/usr/bin/time -o outputblastx.txt -v diamond blastx -p 32 -d uniprot_sprot -q alatum.transcripts.fasta -o alatum_matches.xml --outfmt 5 &> log.blastx.xml &

/usr/bin/time -o output_sprotblastx.txt -v diamond blastx -p 32 -d swissprot -q alatum.transcripts.fasta -o swissprot.alatum_matches.xml --outfmt 5 &> log.swisprot.blastx.xml &

/usr/bin/time -o output_nrblastx.txt -v diamond blastx -p 32 -sensitive -d nr -q alatum.transcripts.fasta -o nr.alatum_matches.xml --outfmt 5 &> log.nr.blastx.xml &

CDD DATABASE

NUCELOTIDE QUERY REF Biostar_help Biopython

GO ENRICHMENT using R

GOseq_R

AUGUSTUS GENE PREDICTION TOOL

GENOMICS TUTORIAL

/data/www/augustus/augustus/bin/augustus --species=arabidopsis --strand=both --singlestrand=false --genemodel=partial --codingseq=on --sample=100 --keep_viterbi=true --alternatives-from-sampling=true --minexonintronprob=0.2 --minmeanexonintronprob=0.5 --maxtracks=2 /data/www/augustus/webservice/data/AUG-1481299735/input.fa --exonnames=on

Using Sequence_IDS pooled from trasncript_dataset

$ seqtk subseq your.input.fasta the_header_of_interest_IDs.list > your_output.fasta

Sequence Length distribution

$ cat file.fa | awk '$0 ~ ">" {print c; c=0;printf substr($0,2,100) "\t"; } $0 !~ ">" {c+=length($0);} END { print c; }'

How to generate sequence length distribution from Fasta file

PFAM

For PFAM annotations, I used hmmscan against the full database downloaded from the xfam page. To download the databank and create the hmmscan file:

wget ftp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.gz #download

$ gunzip Pfam-A.hmm.gz #unzip the database

$ hmmpress Pfam-A.hmm #parse the database

And to run the scan against the query sequences:

$ hmmscan --tblout MySu_v1.PFAM.txt Pfam-A.hmm MySu01_v1.proteins.fasta

pfam_scan_env

#download

wget http://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.dat.gz for Pfam-B.hmm.dat wget http://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam33.1/active_site.dat.gz for active_site.dat

##after download

hmmpress Pfam-A.hmm hmmpress Pfam-B.hmm

pfam_scan.pl -fasta ourproteinsequence.fasta -cpu 32 -outfile outputfilename.txt - as -dir thewholepathofthedirectorofdb

pfam_scan.pl -fasta suwon.blastpep.fasta -cpu 32 -outfile suwon.hmmresults.txt - as -dir /NABIC/HOME/senthil10/datafiles/05.pfam_db

InterproScan

Download

To install and run InterProScan

To Run

./interproscan.sh -appl CDD,COILS,Gene3D,HAMAP,MobiDBLite,PANTHER,Pfam,PIRSF,PRINTS,PROSITEPATTERNS,PROSITEPROFILES,SFLD,SMART,SUPERFAMILY,TIGRFAM -i /path/to/sequences.fasta

Gene Enrichment Analysis

Gene Set Enrichment Analysis with ClusterProfiler

RNA-seq analysis in R

COG classification_Draw by R


library(ggplot2)


dat <- data.frame(
  FunctionClass = factor(c("A", "B", "C", "D", "E", "F", "G", "H", "I",     "J", "K", "L", "M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "Y", "Z"), levels=c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "Y", "Z")),
  legend = c("A: RNA processing and modification", "B: Chromatin structure and dynamics", "C: Energy production and conversion", "D: Cell cycle control, cell division, chromosome partitioning", "E: Amino acid transport and metabolism", "F: Nucleotide transport and metabolism", "G: Carbohydrate transport and metabolism", "H: Coenzyme transport and metabolism", "I: Lipid transport and metabolism", "J: Translation, ribosomal structure and biogenesis", "K: Transcription", "L: Replication, recombination and repair", "M: Cell wall/membrane/envelope biogenesis", "N: Cell motility", "O: Posttranslational modification, protein turnover, chaperones", "P: Inorganic ion transport and metabolism", "Q: Secondary metabolites biosynthesis, transport and catabolism", "R: General function prediction only", "S: Function unknown", "T: Signal transduction mechanisms", "U: Intracellular trafficking, secretion, and vesicular transport", "V: Defense mechanisms", "W: Extracellular structures", "Y: Nuclear structure", "Z: Cytoskeleton"),
  Frequency=c(1516,232,1011,257,934,341,1689,393,1019,1395,2777,862,354,1,2548,849,960,0,8789,3423,1226,258,25,15,476)
)


p <- ggplot(data=dat, aes(x=FunctionClass, y=Frequency, fill=legend))+
  
  geom_bar(stat="identity", position=position_dodge(), colour="seashell")

p + guides (fill = guide_legend(ncol = 1))

p <- ggplot(data=dat, aes(x=FunctionClass, y=No.of.Transcripts, fill=legend))+
  geom_bar(stat="identity", position=position_dodge(), colour="seashell")
p + guides (fill = guide_legend(ncol = 1))+
  xlab("Factor Class")+
  ggtitle("Cluster of Orthologous")

SSR marker prediction_KRAIT

KRAIT

[Krait: an ultrafast tool for genome-wide survey of microsatellites and primer design] (https://academic.oup.com/bioinformatics/article/34/4/681/4557187)

Du, L., Zhang, C., Liu, Q., Zhang, X. & Yue, B. Krait: an ultrafast tool for genome-wide survey of microsatellites and primer design. Bioinformatics 34, 681–683 (2018).

Distribution patterns and variation analysis of simple sequence repeats in different genomic regions of bovid genomes Download PDF

Genome-wide investigation of microsatellite polymorphism in coding region of the giant panda (Ailuropoda melanoleuca) genome: a resource for study of phenotype diversity and abnormal traits

Reference paper

A Literature Review of Gene Function Prediction by Modeling Gene Ontology

Dynamic metabolic and transcriptomic profiling of methyl jasmonate-treated hairy roots reveals synthetic characters and regulators of lignan biosynthesis in Isatis indigotica Fort

Tandem UGT71B5s Catalyze Lignan Glycosylation in Isatis indigotica With Substrates Promiscuity

De Novo Transcriptomes of Forsythia koreana Using a Novel Assembly Method: Insight into Tissue- and Species-Specific Expression of Lignan Biosynthesis-Related Gene

Transcriptomic comparison reveals genetic variation potentially underlying seed developmental evolution of soybeans

Comparative transcriptomes and development of expressed sequence tag-simple sequence repeat markers for two closely related oak species

Methodology references---Identification of Glutathione S-Transferase Genes in Hami Melon (Cucumis melo var. saccharinus) and Their Expression Analysis Under Cold Stress

EnTAP: Bringing Faster and Smarter Functional Annotation to Non-Model Eukaryotic Transcriptomes

A Comparison of Resources for the Annotation of a De Novo Assembled Transcriptome in the Molting Gland (Y-Organ) of the Blackback Land Crab, Gecarcinus lateralis

Comparative transcriptome analysis of cultivated and wild seeds of Salvia hispanica (chia)

Seed Transcriptomics Analysis in Camellia oleifera Uncovers Genes Associated with Oil Content and Fatty Acid Composition

Formation of two methylenedioxy bridges by a Sesamum CYP81Q protein yielding a furofuran lignan, (�)-sesamin

De novo Assembly of Leaf Transcriptome in the Medicinal Plant Andrographis paniculata

Transcriptome Dynamics during Black and White Sesame (Sesamum indicum L.) Seed Development and Identification of Candidate Genes Associated with Black Pigmentation

De novo transcriptome assembly, annotation and comparison of four ecological and evolutionary model salmonid fish species

Data of de novo assembly and functional annotation of the leaf transcriptome of Impatiens balsamina

Transcriptome Analysis by RNA–Seq Reveals Genes Related to Plant Height in Two Sets of Parent-hybrid Combinations in Easter lily (Lilium longiflorum)

Transcriptome analysis of pecan seeds at different developing stages and identification of key genes involved in lipid metabolism

Transcriptome Analysis Reveals Key Seed-Development Genes in Common Buckwheat (Fagopyrum esculentum)

Transcriptome analysis of metabolic pathways associated with oil accumulation in developing seed kernels of Styrax tonkinensis, a woody biodiesel species

Genomic and Transcriptomic Analysis Identified Gene Clusters and Candidate Genes for Oil Content in Peanut (Arachis hypogaea L.)

Comparative transcriptome analysis of the genes involved in lipid biosynthesis pathway and regulation of oil body formation in Torreya grandis kernels

Transcriptome Analysis of Acer truncatum Seeds Reveals Candidate Genes Related to Oil Biosynthesis and Fatty Acid Metabolism

RNA-seq Transcriptome Analysis of Panax japonicus, and Its Comparison with Other Panax Species to Identify Potential Genes Involved in the Saponins Biosynthesis

De Novo Assembly and Annotation of the Chinese Chive (Allium tuberosum Rottler ex Spr.) Transcriptome Using the Illumina Platform

Seed Transcriptome Annotation Reveals Enhanced Expression of Genes Related to ROS Homeostasis and Ethylene Metabolism at Alternating Temperatures in Wild Cardoon

Gene expression profiles that shape high and low oil content sesames

Transcriptome profiling of spike provides expression features of genes related to terpene biosynthesis in lavender

Metabolic engineering of fatty acid biosynthetic pathway in sesame (Sesamum indicum L.): assembling tools to develop nutritionally desirable sesame seed oil

## unreplicated data_Comparative Characterization of the Leaf Tissue of Physalis alkekengi and Physalis peruviana Using RNA-seq and Metabolite Profiling

Full-Length Transcriptome Sequencing and Comparative Transcriptome Analysis to Evaluate Drought and Salt Stress in Iris lactea var. chinensis

Lamiales_ref_Functional characterization of the cytochrome P450 monooxygenase CYP71AU87 indicates a role in marrubiin biosynthesis in the medicinal plant Marrubium vulgare

[Genome sequencing of the important oilseed crop Sesamum indicumL (https://genomebiology.biomedcentral.com/articles/10.1186/gb-2013-14-1-401#ref-CR37)

Phylogenomic analysis of cytochrome P450 multigene family and their differential expression analysis in Solanum lycopersicum L. suggested tissue specific promoters img

Computational Identification and Systematic Classification of Novel Cytochrome P450 Genes in Salvia miltiorrhiza img

Genomic and transcriptomic insights into cytochrome P450 monooxygenase genes involved in nicosulfuron tolerance in maize (Zea mays L.)

Sesamin/Sesamolin/Lignan

(+)‐Sesamin‐oxidising CYP92B14 shapes specialised lignan metabolism in sesame

Glycoside‐specific glycosyltransferases catalyze regio‐selective sequential glucosylations for a sesame lignan, sesaminol triglucoside

Formation of a Methylenedioxy Bridge in (+)-Epipinoresinol by CYP81Q3 Corroborates with Diastereomeric Specialization in Sesame Lignans

Oxidative rearrangement of (+)-sesamin by CYP92B14 co-generates twin dietary lignans in sesame

Formation of two methylenedioxy bridges by a Sesamum CYP81Q protein yielding a furofuran lignan, (+)-sesamin

Sequential glucosylation of a furofuran lignan, (+)‐sesaminol, by Sesamum indicum UGT71A9 and UGT94D1 glucosyltransferases

Variations in the composition of sterols, tocopherols and lignans in seed oils from fourSesamum species

Lignan analysis in seed oils from fourSesamum species: Comparison of different chromatographic methods

Lignans of Sesame (Sesamum indicum L.): A Comprehensive Review

Candidate genes involved in the biosynthesis of lignan in Schisandra chinensis fruit based on transcriptome and metabolomes analysis

The cytochrome P450 superfamily: Key players in plant development and defense

Identification of a binding protein for sesamin and characterization of its roles in plant growth

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published