Skip to content

Commit fdf235c

Browse files
committed
Merge branch 'main' of https://github.com/schosio/CHOLAR
2 parents c44130f + 95e4ff1 commit fdf235c

1 file changed

Lines changed: 0 additions & 50 deletions

File tree

README.md

Lines changed: 0 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -32,53 +32,3 @@ This will generate
3232

3333

3434

35-
# Summary
36-
37-
RNA-sequencing has found numerous implementations in research, from distinguishing immune cell subtypes
38-
to differential gene expression between cancer versus normal tissue types [@Villani2017][@Bao2021].
39-
Another application of RNA-seq is to identify novel transcripts involved in various biological processes
40-
(Gupta, Kleinjans and Caiment 2021). The most relevant is context and cell-type-specific non-coding RNAs,
41-
such as long non-coding RNAs (lncRNAs), which have become a case-point for most transcriptomic studies proving
42-
their role in regulating gene expression, post-transcriptional regulation, and epigenetic regulation
43-
(Engreitz et al. 2016, Zhu et al. 2019).
44-
It is becoming crucial to check the relative expression of lncRNAs in transcriptome-wide studies. Our group
45-
has developed an automated lncRNA expression pipeline. The only requirement from the user-side is raw data in
46-
FASTQ format ( Paired-end or Single-end). The user will get a list of known and novel lncRNAs, and differential
47-
gene expression between condition(s). The pipelines come with a user-friendly GUI, thereby eliminating the need
48-
for the user to be versed in complex transcriptome analysis and UNIX environment. The source code is available
49-
under an open-source licence at https://github.com/schosio/CHOLAR.
50-
51-
# Statement of need
52-
53-
The number of inferences generated from RNA-seq datasets is countless. The software used in the RNA-seq analysis
54-
pipeline requires a UNIX-based command-line interactive (CLI) environment, with each software executed in succession.
55-
Installing multiple CLI tools, handling various file formats, and plotting graphs require understanding of Linux and
56-
R programming language. Moreover, no tool identifies novel lncRNAs from raw transcriptomic data to the best of our
57-
knowledge. LncRNA identification tools such as `CPAT` (Wang et al. 2013) (and other tools) take either transcript
58-
sequence (FASTA) or transcript coordinate (BED, GTF) as input and provide a list of predicted lncRNAs.
59-
To address these issues, we developed CHOLAR which is a tool for characterization of LncRNA from raw reads .
60-
`CHOLAR` i) identifies novel lncRNAs from raw reads ii) provides a user-friendly GUI interface to make changes
61-
at every step iii) allows to identify differentially expressed genes and lists known and novel lncRNAs iv) generates
62-
publication-quality plots such as MA plot, Volcano plot and heatmap.
63-
64-
# Implementation
65-
66-
The `CHOLAR` pipeline is implemented in bash and R, where it first reads the input FASTQ files(s) to check the
67-
quality of reads using `FastQC` (Fiancette et al. 2021). Bad quality ( < 28 ) reads and adaptors are removed using
68-
`Trimmomatic` (Bolger, Lohse and Usadel 2014). `HISAT2` performs the mapping of reads on the human reference genome
69-
(hg38) (Zhang et al. 2021). The SAM files generated from `HISAT2` are converted to BAM, and PCR duplicates are removed
70-
utilising the `samtools` toolkit (Danecek et al. 2021).
71-
The transcript assembly is done using Stringtie, and the resulting GTF files are merged using the merge utility of
72-
`Stringtie` (Pertea et al. 2015). The merged GTF file is compared against the reference annotation file from GENCODE
73-
(Frankish et al. 2021) to filter novel transcripts using `GFFCOMPARE`. The coding potential of novel transcripts is
74-
predicted using CPAT (Wang et al. 2013). The gene counts are calculated using `HTSeq` (Putri et al. 2022), and subsequent
75-
differential gene expression analysis (DGEA) is done using packages from R statistical language.
76-
The `DESeq2` package is used for performing DGEA (Love, Huber and Anders 2014), and `ggplot` and `dplyr` libraries for
77-
plotting graphs. The Graphical user interface is built using zenity in bash. The schematic of the tool is given in figure 1.
78-
79-
# Example
80-
81-
We chose the GSE147761 dataset from the GEO database (NCBI) to showcase the CHOLAR tool. A sample of results and plots
82-
generated by the tool are given in figure 2. The GUI of the tool is shown in figure 3.
83-
84-

0 commit comments

Comments
 (0)