DE-kupl is a pipeline that finds differentially expressed k-mers between RNA-Seq datasets.
Dekupl-run handles the first part of the DE-kupl pipeline from raw FASTQ to the production of contigs from differentially expressed k-mers.
Before using Dekupl-run, install these dependencies:
- Snakemake
- jellyfish
- pigz
- CMake
- R:
- DESEq2 : open R and execute :
> source("https://bioconductor.org/biocLite.R")> biocLite("DESeq2") - RColorBrewer
- pheatmap
- DESEq2 : open R and execute :
- Python:
- rpy2 :
pip3 install rpy2
- rpy2 :
- CracTools:
cpanm CracTools::Utils
- Clone this repository including submodules :
git clone --recursive https://github.com/Transipedia/dekupl-run.git - Edit the config.json file to add the list of your samples, their conditions and the location their FASTQ file. See next section for parameters description.
- Run the pipeline with then
snakemake -jNB_THREADS --resources ram=MAX_MEMORY -pcommand. ReplaceNB_THREADSwith the number of threads andMAX_MEMORYwith the maximum memory (in Megabyte) you want DEkupl to allocate. - Once Dekupl-run has been fully executed, DE contigs produced by Dekupl-run
(under
DEkupl_results/A_vs_B_kmer_counts/merged-diff-counts.tsv.gz) can be annotate using Dekupl-annotation
- fastq_dir: Location of FASTQ files
- nb_threads: Default number of thread to use (unless specified in the snakemake command-line
- kmer_length: Length of k-mers (default: 31). This value shoud not exceed 32.
- diff_method: Method used for differential testing (default: DESeq2). Possible choices are 'Ttest' which is fast and 'DESeq2' which is more sensitive but longer to run.
- lib_type: Paired-end library type (default:
rf). You can specify eitherrffor reverse-forward strand-specific libraries,frfor strand-specific forward-reverse, orunstrandedfor unstranded libraries. - output_dir: Location of DE-kupl results (default:
DEkupl_result). - tmp_dir: Temporary directory to use (default:
./aka current directory) - r1_suffix: Suffix to use for the FASTQ with left mate. Set
r2_suffixfor the second FASTQ. - dekupl_counter:
- min_recurrence: Minimum number of samples to support a k-mer
- min_recurrence_abundance: Min abundance threshold to consider a k-mer in the reccurency filter.
- Ttest:
- condition: Specify A and B conditions.
- pvalue_threshold: Min p-value (adjusted) to consider a k-mer as DE. Only DE k-mers are selected for assembly.
- log2fc_threshold: Min Log2 Fold Change to consider a k-mer as DE.
- Samples: An array of samples. Each sample is described by a
nameand acondition. The FASTQ files for a sample will be located using the following commandfastq_dir/sample_name_{1,2}.fastq.gz
The output directory of a DE-kupl will have the following content :
├── {A}_vs_{B}_kmer_counts
│ ├── diff-counts.tsv.gz
│ ├── merged-diff-counts.tsv.gz
├── gene_expression
│ ├── {A}vs{B}-DEGs.tsv
├── kmer_counts
│ ├── normalization_factors.tsv
│ ├── raw-counts.tsv.gz
│ ├── noGENCODE-counts.tsv.gz
│ ├── {sample}.jf
│ ├── {sample}.txt.gz
│ ├── ...
├── metadata
│ ├── sample_conditions.tsv
│ ├── sample_conditions_full.tsv
The following table describes the output files produced by DE-kupl :
| FileName | Description |
|---|---|
diff-counts.tsv.gz |
Contains k-mers counts from noGENCODE-counts.tsv.gz that have passed the differential testing. Output format is a tsv with the following columns: kmer pvalue meanA meanB log2FC [SAMPLES]. |
merged-diff-counts.tsv.gz |
Contains assembled k-mers from diff-counts.tsv.gz. Output format is a tsv with the following columns: nb_merged_kmers contig kmer pvalue meanA meanB log2FC [SAMPLES]. |
raw-counts.tsv.gz |
Containins raw k-mer counts of all libraries that have been filtered with the reccurency filters. |
noGENCODE-counts.tsv.gz |
Containtains k-mer counts filtered from raw-counts.tsv with the k-mers from the reference transcription (ex: GENCODE by default). |
sample_conditions_full.tsv |
Tabulated file with samples names, conditions and normalization factors. sample_conditions.tsv is the sample |
- if new samples are added to the config.json, make sure to remove the
metadatafolder in order to force SnakeMake to re-make all targets that depends on this file
- Create a dekupl binary with two commands :
dekupl build_index {genome}: This command will download reference files and create all indexesdekupl run {dekupl_index} {config.yml} {output_dir}: This command will run the dekupl pipeline