diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md new file mode 100644 index 0000000..822a879 --- /dev/null +++ b/CONTRIBUTING.md @@ -0,0 +1,89 @@ +# Contributing guidelines + +We thank you in advance :thumbsup: :tada: for taking the time to contribute, whether with *code* or with *ideas*, to the project. + + +## Did you find a bug? + +* Ensure that the bug was not already reported by [searching under Issues]. + +* If you're unable to find an (open) issue addressing the problem, [open a new one]. Be sure to prefix the issue title with **[BUG]** and to include: + + - a *clear* description, + - as much relevant information as possible, and + - a *code sample* or an (executable) *test case* demonstrating the expected behaviour that is not occurring. + +## How to work on a new feature/bug + +Create an issue on Github or you can alternatively pick one already created. + +Assign yourself to that issue. + +Discussions on how to proceed about that issue take place in the comment section on that issue. + +Some of the work might have been done already by somebody, hence we avoid unnecessary work duplication and a waste of time and effort. Other reason for discussing the issue beforehand is to communicate with the team the changes as some of the features might impact different components, and we can plan accordingly. + +## How we work with Git + +All work should take place in a dedicated branch with a short descriptive name. + +Use comments in your code, choose variable and function names that clearly show what you intend to implement. + +Once the feature is done you can request it to be merged back into `main` by making a Pull Request. + +Before making the pull request it is a good idea to rebase your branch to `main` to ensure that eventual conflicts with the `main` branch is solved before the PR is reviewed and we can therefore have a clean merge. + + +### General stuff about git and commit messages + +In general it is better to commit often. Small commits are easier to roll back and also makes the code easier to review. + +Write helpful commit messages that describes the changes and possibly why they were necessary. + +Each commit should contain changes that are functionally connected and/or related. If you for example want to write _and_ in the first line of the commit message this is an indicator that it should have been two commits. + +Learn how to select chunks of changed files to do multiple separate commits of unrelated things. This can be done with either `git add -p` or `git commit -p`. + + +#### Helpful commit messages + +The commit messages may be seen as meta-comments on the code that are incredibly helpful for anyone who wants to know how this piece of software is working, including colleagues (current and future) and external users. + +Some tips about writing helpful commit messages: + + 1. Separate subject (the first line of the message) from body with a blank line. + 2. Limit the subject line to 50 characters. + 3. Capitalize the subject line. + 4. Do not end the subject line with a period. + 5. Use the imperative mood in the subject line. + 6. Wrap the body at 72 characters. + 7. Use the body to explain what and why vs. how. + +For an in-depth explanation of the above points, please see [How to Write a Git Commit Message](http://chris.beams.io/posts/git-commit/). + + +### How we do code reviews + +A code review is initiated when someone has made a Pull Request in the appropriate repo on github. + +Work should not continue on the branch _unless_ it is a [Draft Pull Request](https://github.blog/2019-02-14-introducing-draft-pull-requests/). Once the PR is marked ready the review can start. + +The initiator of the PR should recruit a reviewer that get assigned reviewer duty on the branch. + +Other people may also look at and review the code. + +A reviewers job is to: + + * Write polite and friendly comments - remember that it can be tough to have other people critizising your work, a little kindness goes a long way. This does not mean we should not comment on things that need to be changed of course. + * Read the code and make sure it is understandable + * Make sure that commit messages and commits are structured so that it is possible to understand why certain changes were made. + +Once the review is positive the Pull Request can be _merged_ into `main` and the feature branch deleted. + + +---- + +Thanks again. + +[searching under Issues]: https://github.com/Juke34/RAIN/issues?utf8=%E2%9C%93&q=is%3Aissue%20label%3Abug%20%5BBUG%5D%20in%3Atitle +[open a new one]: https://github.com/Juke34/RAIN/issues/new?title=%5BBUG%5D \ No newline at end of file diff --git a/README.md b/README.md new file mode 100644 index 0000000..58c6d26 --- /dev/null +++ b/README.md @@ -0,0 +1,205 @@ +![GitHub CI](https://github.com/Juke34/RAIN/actions/workflows/main.yml/badge.svg) + +# RAIN - RNA Alterations Investigation using Nextflow + +RAIN is a Nextflow workflow designed for epitranscriptomic analyses, enabling the detection of RNA modifications in comparison to a reference genome. +Its primary goal is to distinguish true RNA editing events from genomic variants such as SNPs, with a particular emphasis on identifying A-to-I (Adenosine-to-Inosine) editing. + + + + + +## Table of Contents + + * [Foreword](#foreword) + * [Flowchart](#flowchart) + * [Installation](#installation) + * [Nextflow](#nextflow) + * [Container platform](#container-platform) + * [Docker](#docker) + * [Singularity](#singularity) + * [Usage and test](#usage) + * [Parameters](#parameters) + * [Output](#output) + * [Author](#author-and-contributors) + * [Contributing](#contributing) + + +## Foreword + +... + +## Flowchart + +... + +## Installation + +The prerequisites to run the pipeline are: + + * [Nextflow](https://www.nextflow.io/) >= 22.04.0 + * [Docker](https://www.docker.com) or [Singularity](https://sylabs.io/singularity/) + +### Nextflow + + * Via conda + +
+ See here + + ```bash + conda create -n nextflow + conda activate nextflow + conda install bioconda::nextflow + ``` +
+ + * Manually +
+ See here + Nextflow runs on most POSIX systems (Linux, macOS, etc) and can typically be installed by running these commands: + + ```bash + # Make sure 11 or later is installed on your computer by using the command: + java -version + + # Install Nextflow by entering this command in your terminal(it creates a file nextflow in the current dir): + curl -s https://get.nextflow.io | bash + + # Add Nextflow binary to your user's PATH: + mv nextflow ~/bin/ + # OR system-wide installation: + # sudo mv nextflow /usr/local/bin + ``` +
+ +### Container platform + +To run the workflow you will need a container platform: docker or singularity. + +### Docker + +Please follow the instructions at the [Docker website](https://docs.docker.com/desktop/) + +### Singularity + +Please follow the instructions at the [Singularity website](https://docs.sylabs.io/guides/latest/admin-guide/installation.html) + +## Usage + +### Help + +You can first check the available options and parameters by running: + +```bash +nextflow run Juke34/RAIN -r v1.5.0 --help +``` + +### Profile + +To run the workflow you must select a profile according to the container platform you want to use: +- `singularity`, a profile using Singularity to run the containers +- `docker`, a profile using Docker to run the containers + +The command will look like that: + +```bash +nextflow run Juke34/RAIN -r vX.X.X -profile docker +``` + +Another profile is available (/!\\ actually not yet implemented): + +- `slurm`, to add if your system has a slurm executor (local by default) + +The use of the `slurm` profile will give a command like this one: + +```bash +nextflow run Juke34/RAIN -r vX.X.X -profile singularity,slurm +``` + +### Test + +With nextflow and docker available you can run (where vX.X.X is the release version you wish to use): + +```bash +nextflow run -profile docker,test Juke34/RAIN -r vX.X.X +``` + +Or via a clone of the repository: + +``` +git clone https://github.com/Juke34/rain.git +cd rain +nextflow run -profile docker,test rain.nf +``` + +## Parameters + +``` +RAIN - RNA Alterations Investigation using Nextflow - v0.1 + + Usage example: + nextflow run rain.nf -profile docker --genome /path/to/genome.fa --annotation /path/to/annotation.gff3 --reads /path/to/reads_folder --output /path/to/output --aligner hisat2 + + Parameters: + --help Prints the help section + + Input sequences: + --annotation Path to the annotation file (GFF or GTF) + --reads path to the reads file, folder or csv. If a folder is provided, all the files with proper extension in the folder will be used. You can provide remote files (commma separated list). + file extension expected : <.fastq.gz>, <.fq.gz>, <.fastq>, <.fq> or <.bam>. + for paired reads extra <_R1_001> or <_R2_001> is expected where and <_001> are optional. e.g. , , ) + csv input expects 6 columns: sample, fastq_1, fastq_2, strandedness and read_type. + fastq_2 is optional and can be empty. Strandedness, read_type expects same values as corresponding RAIN parameter; If a value is provided via RAIN paramter, it will override the value in the csv file. + Example of csv file: + sample,fastq_1,fastq_2,strandedness,read_type + control1,path/to/data1.fastq.bam,,auto,short_single + control2,path/to/data2_R1.fastq.gz,path/to/data2_R2.fastq.gz,auto,short_paired + --genome Path to the reference genome in FASTA format. + --read_type Type of reads among this list [short_paired, short_single, pacbio, ont] (no default) + + Output: + --output Path to the output directory (default: result) + + Optional input: + --aligner Aligner to use [default: hisat2] + --edit_site_tool Tool used for detecting edited sites. Default: reditools3 + --strandedness Set the strandedness for all your input reads (default: null). In auto mode salmon will guess the library type for each fastq sample. [ 'U', 'IU', 'MU', 'OU', 'ISF', 'ISR', 'MSF', 'MSR', 'OSF', 'OSR', 'auto' ] + --edit_threshold Minimal number of edited reads to count a site as edited (default: 1) + --aggregation_mode Mode for aggregating edition counts mapped on genomic features. See documentation for details. Options are: "all" (default) or "cds_longest" + --clipoverlap Clip overlapping sequences in read pairs to avoid double counting. (default: false) + + Nextflow options: + -profile Change the profile of nextflow both the engine and executor more details on github README [debug, test, itrop, singularity, local, docker] +``` + +## Output + +Here the description of typical ouput you will get from RAIN: + +``` +└── rain_results # Output folder set using --outdir. Default: + │ + ├── AliNe # Folder containing AliNe alignment pipeline result (see https://github.com/Juke34/AliNe) + │ ├── alignment # bam alignment used by RAIN + │ ├── salmon_strandedness # strandedness collected by AliNe in case auto mode was in used for fastq files + │ └── ... + │ + ├── bam_indicies # bam and indices bam.bai + │ + ├── FastQC # bam and indices bam.bai + │ + ├── gatk_markduplicates # metrics and bam after markduplicates + │ + └── Reditools2/Reditools3/Jacusa/sapin/ # Editing output from corresponding tool + │ + └── feature_edits # Editing computed at different level (genomic features, chromosome, global) + +## Author and contributors + +Eduardo Ascarrunz (@eascarrunz) +Jacques Dainat (@Juke34) + +## Contributing + +Contributions from the community are welcome ! See the [Contributing guidelines](https://github.com/Juke34/rain/blob/main/CONTRIBUTING.md) \ No newline at end of file diff --git a/build_images.sh b/build_images.sh index da28da7..532cf54 100755 --- a/build_images.sh +++ b/build_images.sh @@ -28,11 +28,9 @@ do echo ██████████████████▓▒░ Building ${imgname} ░▒▓██████████████████ # Reditools2 does not compile on arm64, force using amd64 compilation - if [[ $dir =~ "reditools2" ]];then - if [[ "$arch" == arm* || "$arch" == "aarch64" ]]; then - echo "Reditools2 does not compile on arm64, force using amd64 compilation" - docker_arch_option=" --platform linux/amd64" - fi + if [[ "$arch" == arm* || "$arch" == "aarch64" ]]; then + echo "Reditools2 does not compile on arm64, force using amd64 compilation" + docker_arch_option=" --platform linux/amd64" fi docker build ${docker_arch_option} -t ${imgname} . diff --git a/data/chr21/chr21_small.bam b/data/chr21/chr21_small.bam new file mode 100644 index 0000000..b805d64 Binary files /dev/null and b/data/chr21/chr21_small.bam differ diff --git a/data/chr21/chr21_small.csv b/data/chr21/chr21_small.csv new file mode 100644 index 0000000..49248fe --- /dev/null +++ b/data/chr21/chr21_small.csv @@ -0,0 +1,3 @@ +sample,input_1,input_2,strandedness,read_type +test1,/Users/jacquesdainat/git/Juke34/rain/data/chr21/chr21_small_R1.fastq.gz,,auto,short_single +test2,/Users/jacquesdainat/git/Juke34/rain/data/chr21/chr21_small_R2.fastq.gz,,ISR,short_single \ No newline at end of file diff --git a/modules/aline.nf b/modules/aline.nf index 73e9f8d..7b81aa2 100644 --- a/modules/aline.nf +++ b/modules/aline.nf @@ -34,6 +34,7 @@ process AliNe { read_type, aligner, library_type, + "--data_type rna", "--outdir $task.workDir/AliNe", ].join(" ") // Copy command to shell script in work dir for reference/debugging. diff --git a/modules/reditools3.nf b/modules/reditools3.nf index acd4bd2..f833047 100644 --- a/modules/reditools3.nf +++ b/modules/reditools3.nf @@ -14,19 +14,19 @@ process reditools3 { script: // Set the strand orientation parameter from the library type parameter // Terms explained in https://salmon.readthedocs.io/en/latest/library_type.html - if (meta.libtype in ["ISR", "SR"]) { + if (meta.strandedness in ["ISR", "SR"]) { // First-strand oriented strand_orientation = "2" } else if (meta.libtype in ["ISF", "SF"]) { // Second-strand oriented strand_orientation = "1" - } else if (meta.libtype in ["IU", "U"]) { + } else if (meta.strandedness in ["IU", "U"]) { // Unstranded strand_orientation = "0" } else { // Unsupported: Pass the library type string so that it's reported in // the reditools error message - strand_orientation = meta.libtype + strand_orientation = "0" } base_name = bam.BaseName diff --git a/nextflow.config b/nextflow.config index 16370da..31dcdd9 100644 --- a/nextflow.config +++ b/nextflow.config @@ -55,10 +55,10 @@ profiles { test { params.aline_profiles = "${baseDir}/config/resources/base_aline.config" params.aligner = "STAR" - params.reads = "${baseDir}/data/chr21/chr21_small_R1.fastq.gz " + params.reads = "${baseDir}/data/chr21/chr21_small_R1.fastq.gz" params.genome = "${baseDir}/data/chr21/chr21_small.fasta.gz" params.annotation = "${baseDir}/data/chr21/chr21_small_filtered.gff3.gz" - params.library_type = "ISR" + params.strandedness = "ISR" params.read_type = "short_single" } test2 { @@ -67,7 +67,7 @@ profiles { params.reads = "${baseDir}/data/chr21/" params.genome = "${baseDir}/data/chr21/chr21_small.fasta.gz" params.annotation = "${baseDir}/data/chr21/chr21_small_filtered.gff3.gz" - params.library_type = "ISR" + params.strandedness = "ISR" params.read_type = "short_paired" } } diff --git a/rain.nf b/rain.nf index 43d7de4..9366569 100644 --- a/rain.nf +++ b/rain.nf @@ -11,20 +11,19 @@ import java.nio.file.* // Input/output params params.reads = null // "/path/to/reads_{1,2}.fastq.gz/or/folder" -params.bam = null // "/path/to/reads.bam/or/folder" -params.csv = null // "/path/to/reads.bam/or/folder" params.genome = null // "/path/to/genome.fa" params.annotation = null // "/path/to/annotations.gff3" -params.outdir = "result" -params.clipoverlap = false +params.outdir = "rain_result" +params.clipoverlap = false /* Specific AliNe params (some are shared with RAIN)*/ // Read feature params -read_type_allowed = [ 'short_paired', 'short_single', 'pacbio', 'ont' ] -params.read_type = "short_paired" // short_paired, short_single, pacbio, ont -params.library_type = "auto" // can be 'U', 'IU', 'MU', 'OU', 'ISF', 'ISR', 'MSF', 'MSR', 'OSF', 'OSR', 'auto' - see https://github.com/Juke34/AliNe for more information -params.bam_library_type = null // can be 'U', 'IU', 'MU', 'OU', 'ISF', 'ISR', 'MSF', 'MSR', 'OSF', 'OSR', 'auto' - see https://github.com/Juke34/AliNe for more information +read_type_allowed = [ 'short_paired', 'short_single', 'pacbio', 'ont' ] +params.read_type = null // short_paired, short_single, pacbio, ont +strandedness_allowed = [ 'U', 'IU', 'MU', 'OU', 'ISF', 'ISR', 'MSF', 'MSR', 'OSF', 'OSR', 'auto' ] // see https://github.com/Juke34/AliNe for more information +params.strandedness = null +params.read_length = null // Use by star to set the sjdbOverhang parameter // Edit counting params edit_site_tools = ["reditools2", "reditools3", "jacusa2", "sapin"] @@ -75,10 +74,17 @@ def helpMSG() { Input sequences: --annotation Path to the annotation file (GFF or GTF) - --bam Path to the bam file or folder. - --csv Path to the csv file containing fastq/bam files and their metadata. The csv file must contain a 'sample' column with the sample name, a 'input_1' column with the path to the bam file/or fastq1 file, a 'input_2' column with the path to R2 fastq file in case of paired data (either empty), and a 'strandedness' column with the library type (e.g. U, IU, MU, OU, ISF, ISR, MSF, MSR, OSF, OSR). - --reads Path to the reads file or folder. If a folder is provided, all the files with proper extension (.fq, .fastq, .fq.gz, .fastq.gz) in the folder will be used. You can provide remote files (commma separated list). + --reads path to the reads file, folder or csv. If a folder is provided, all the files with proper extension in the folder will be used. You can provide remote files (commma separated list). + file extension expected : <.fastq.gz>, <.fq.gz>, <.fastq>, <.fq> or <.bam>. + for paired reads extra <_R1_001> or <_R2_001> is expected where and <_001> are optional. e.g. , , ) + csv input expects 6 columns: sample, fastq_1, fastq_2, strandedness and read_type. + fastq_2 is optional and can be empty. Strandedness, read_type expects same values as corresponding AliNe parameter; If a value is provided via AliNe paramter, it will override the value in the csv file. + Example of csv file: + sample,fastq_1,fastq_2,strandedness,read_type + control1,path/to/data1.fastq.bam,,auto,short_single + control2,path/to/data2_R1.fastq.gz,path/to/data2_R2.fastq.gz,auto,short_paired --genome Path to the reference genome in FASTA format. + --read_type Type of reads among this list ${read_type_allowed} (no default) Output: --output Path to the output directory (default: $params.outdir) @@ -86,9 +92,7 @@ def helpMSG() { Optional input: --aligner Aligner to use [default: $params.aligner] --edit_site_tool Tool used for detecting edited sites. Default: $params.edit_site_tool - --read_type Type of reads among this list ${read_type_allowed} (default: short_paired) - --library_type Set the library_type of your fastq reads (default: auto). In auto mode salmon will guess the library type for each sample. [ 'U', 'IU', 'MU', 'OU', 'ISF', 'ISR', 'MSF', 'MSR', 'OSF', 'OSR', 'auto' ] - --bam_library_type Set the library_type of your BAM reads (default: null). When using BAM you must set a library type: [ 'U', 'IU', 'MU', 'OU', 'ISF', 'ISR', 'MSF', 'MSR', 'OSF', 'OSR', 'auto' ] + --strandedness Set the strandedness for all your input reads (default: null). In auto mode salmon will guess the library type for each fastq sample. [ 'U', 'IU', 'MU', 'OU', 'ISF', 'ISR', 'MSF', 'MSR', 'OSF', 'OSR', 'auto' ] --edit_threshold Minimal number of edited reads to count a site as edited (default: 1) --aggregation_mode Mode for aggregating edition counts mapped on genomic features. See documentation for details. Options are: "all" (default) or "cds_longest" --clipoverlap Clip overlapping sequences in read pairs to avoid double counting. (default: false) @@ -104,19 +108,15 @@ log.info """ General Parameters annotation : ${params.annotation} - bam : ${params.bam} - csv : ${params.csv} reads : ${params.reads} genome : ${params.genome} - - reads_library_type : ${params.library_type} - bam_library_type : ${params.bam_library_type} + strandedness : ${params.strandedness} outdir : ${params.outdir} Alignment Parameters aline_profiles : ${params.aline_profiles} aligner : ${params.aligner} - reads_library_type : ${params.library_type} + Edited Site Detection Parameters edit_site_tool : ${params.edit_site_tool} @@ -202,6 +202,12 @@ workflow { // ---------------------------------------------------------------------------- // --- DEAL WITH REFERENCE --- // check if reference exists + if (!params.reads) { + exit 1, "No input reads provided with --reads (fastq or bam files as file, file list, folder or csv)." + } + if (!params.genome) { + exit 1, "You must provide a path to the genome with --genome" + } Channel.fromPath(params.genome, checkIfExists: true) .ifEmpty { exit 1, "Cannot find genome matching ${params.genome}!\n" } .set{genome_raw} @@ -216,162 +222,249 @@ workflow { .ifEmpty { exit 1, "Cannot find annotation matching ${params.annotation}!\n" } .set{annotation} } -// ---------------------------------------------------------------------------- - def path_csv = params.csv - def path_bam = params.bam - def path_reads = params.reads - def has_fastq = false - // ---------------------- DEAL WITH BAM FILES ---------------------- - Channel.empty().set{sorted_bam} - if ( path_bam || path_csv ) { - - def bam_list=[] - def via_url = false - def via_csv = false - - if ( path_csv && path_csv.endsWith('.csv') ){ - log.info "Using CSV input file: ${path_csv}" - // --------- BAM CSV CASE --------- - via_csv = true - File input_csv = new File(path_csv) - if(!input_csv.exists()){ - error "The input ${params.csv} file does not exist!\n" - } - bams = Channel.fromPath(params.csv) - .splitCsv(header: true, sep: ',') - .map { row -> - if(row.sample == null || row.sample.trim() == ''){ - error "The input ${params.csv} file does not contain a 'sample' column!\n" - } - def sample_id = row.sample - if(row.input_1 == null || row.input_1.trim() == ''){ - error "The input ${params.csv} file does not contain a 'input_1' column!\n" - } - if( row.input_1.toString().endsWith('.bam') ){ - def input_1 = file(row.input_1.trim()) +// ---------------------------------------------------------------------------- +// DEAL WITH CSV FILE FIRST +// ---------------------------------------------------------------------------- + def path_reads = params.reads + def via_url = false + def via_csv = false + Channel.empty().set{csv_ch} + + if ( path_reads.endsWith('.csv') ){ + log.info "Using CSV input file: ${path_reads}" + // --------- BAM CSV CASE --------- + via_csv = true + File input_csv = new File(path_reads) + if(!input_csv.exists()){ + error "The input ${path_reads} file does not exist!\n" + } + + csv_ch = Channel.fromPath(params.reads) + .splitCsv(header: true, sep: ',') + .map { row -> + if(row.sample == null || row.sample.trim() == ''){ + error "The input ${params.reads} file does not contain a 'sample' column!\n" + } + def sample_id = row.sample + if(row.input_1 == null || row.input_1.trim() == ''){ + error "The input ${params.reads} file does not contain a 'input_1' column!\n" + } + def input_1 = file(row.input_1.trim()) + if( input_1.toString().endsWith('.bam') || Rain_utilities.is_fastq(input_1) ) { - if (! Rain_utilities.is_url(input_1) ) { - if (! input_1.exists() ) { - error "The input ${input_1} file does not does not exits!\n" - } - } else { - log.info "This bam input is an URL: ${input_1}" + def input_type = input_1.toString().endsWith('bam') ? 'bam' : 'fastq' + def input_url = null + if (! Rain_utilities.is_url(input_1) ) { + if (! input_1.exists() ) { + error "The input ${input_1} file does not does not exits!\n" } + } else { + log.info "This bam input is an URL: ${input_1}" + input_url = true + } + // check if strandedness is provided. Priority to params.strandedness. If not provided neither in csv then send an error + def libtype + if ( params.strandedness ) { + libtype = params.strandedness + } else { if(row.strandedness == null || row.strandedness.trim() == ''){ - error "The input ${params.csv} file does not contain a 'strandedness' column!\n" + error "The input ${params.reads} file does not contain a 'strandedness' column!\n" } - def libtype = row.strandedness ?: 'U' - def meta = [ id: sample_id, libtype: libtype ] - return tuple(meta, input_1) - } - else if ( Rain_utilities.is_fastq(row.input_1.trim()) ) { - has_fastq = true - /* def input_1 = file(row.input_1.trim()) - if (! Rain_utilities.is_url(input_1) ) { - if (! input_1.exists() ) { - error "The input ${input_1} file does not does not exits!\n" - } - } else { - log.info "This fastq input is an URL: ${input_1}" - } - */ - } - else { - error "The input ${row.input_1} file is not a BAM or FASTQ file!\n" } + libtype = row.strandedness.trim() + def meta = [ id: sample_id, strandedness: libtype, input_type: input_type, is_url: input_url ] + return tuple(meta, input_1) + } + else { + error "The input ${row.input_1} file is not a BAM or FASTQ file!\n" } - } - else { - - // --------- BAM LIST CASE --------- - if( path_bam.indexOf(',') >= 0) { - // Cut into list with coma separator - str_list = path_bam.tokenize(',') - // loop over elements - str_list.each { - str_list2 = it.tokenize(' ') - str_list2.each { - if ( it.endsWith('.bam') ){ - if ( Rain_utilities.is_url(it) ) { - log.info "This bam input is an URL: ${it}" - via_url = true } - bam_list.add(file(it)) // use file insted of File for URL + // If we are here it is because the csv is correct. + // lets create another channel with meta information + Channel.fromPath(params.reads) + .splitCsv(header: true) + .map { row -> + def File file = new File(row.input_1) + fileClean = file.baseName.replaceAll(/\.(gz)$/, '') // remove .gz + fileClean = fileClean.replaceAll(/\.(fastq|fq)$/, '') // remove .fastq or .fq + + return tuple (fileClean, row.sample, row.strandedness) + } + .set { csv_meta_ch } + } + + // Separate FASTQ samples to BAM samples + csv_in_bam = csv_ch.filter { meta, reads -> meta.input_type == 'bam' } + csv_in_fastq = csv_ch.filter { meta, reads -> meta.input_type != 'fastq' } + +// ---------------------------------------------------------------------------- +// DEAL WITH BAM FILES +// ---------------------------------------------------------------------------- + + def bam_path_reads = params.reads + def bam_list=[] + Channel.empty().set{bams} + + if (via_csv) { + bams = csv_in_bam + } + else { + + // --------- BAM LIST CASE --------- + if( bam_path_reads.indexOf(',') >= 0) { + // Cut into list with coma separator + str_list = bam_path_reads.tokenize(',') + // loop over elements + str_list.each { + str_list2 = it.tokenize(' ') + str_list2.each { + if ( it.endsWith('.bam') ){ + if ( Rain_utilities.is_url(it) ) { + log.info "This bam input is an URL: ${it}" + via_url = true } + bam_list.add(file(it)) // use file insted of File for URL } } } - else { - // --------- BAM FOLDER CASE --------- - File input_reads = new File(path_bam) - if(input_reads.exists()){ - if ( input_reads.isDirectory()) { - log.info "The input ${path_bam} is a folder!" - // in case of folder provided, add a trailing slash if missing - path_bam = "${input_reads}" + "/*.bam" + } + else { + // --------- BAM FOLDER CASE --------- + def File input_reads = new File(bam_path_reads) + if(input_reads.exists()){ + if ( input_reads.isDirectory()) { + log.info "The input ${bam_path_reads} is a folder! Check for bam presence..." + // in case of folder provided, add a trailing slash if missing + def matchingFiles = input_reads.listFiles().findAll { + it.name ==~ /.*\.bam$/ } - // --------- BAM FILE CASE --------- - else { - if ( path_bam.endsWith('.bam') ){ - log.info "The input ${path_bam} is a bam file!" - if ( Rain_utilities.is_url(path_bam) ) { - log.info "This bam input is an URL: ${path_bam}" - via_url = true - } + if (matchingFiles) { + bam_path_reads = "${input_reads}" + "/*.bam" + log.info "Bam file found" + } else { + bam_path_reads = null + log.info "No bam file found" + } + } + // --------- BAM FILE CASE --------- + else { + if ( bam_path_reads.endsWith('.bam') ){ + log.info "The input ${bam_path_reads} is a bam file!" + if ( Rain_utilities.is_url(bam_path_reads) ) { + log.info "This bam input is an URL: ${bam_path_reads}" + via_url = true } } } } } + } - // Add lib - if (! via_csv) { - if (via_url ){ - my_samples = Channel.of(bam_list) - bams = my_samples.flatten().map { it -> [it.name.split('_')[0], it] } - .groupTuple() - .ifEmpty { exit 1, "Cannot find reads matching ${path_reads}!\n" } - } else { - bams = Channel.fromFilePairs(path_bam, size: 1, checkIfExists: true) - .ifEmpty { exit 1, "Cannot find reads matching ${path_bam}!\n" } + // Load Bam + if (! via_csv) { + if ( params.strandedness && params.strandedness.toUpperCase() == "AUTO" ) { + log.info "⚠️ The Strandedness mode cannot be invoked for BAM files, it will be set to null, otherwise you need to use the --strandedness parameter. See help for more information." + } + if (via_url ){ + my_samples = Channel.of(bam_list) + bams = my_samples.flatten().map { it -> [it.name.split('_')[0], it] } + .groupTuple() + .ifEmpty { exit 1, "Cannot find reads matching ${bam_path_reads}!\n" } + } else { + if (bam_path_reads.endsWith('.bam')) { + bams = Channel.fromFilePairs(bam_path_reads, size: 1, checkIfExists: false) } - - // check if bam library type is provided when bam files are used - bams.collect().map { list -> - if (!list.isEmpty() && !params.bam_library_type ) { - exit 1, "⚠️ When using BAM files, the library type must be provided. Please use the parameter --bam_library_type .\n" - } - } - // Set structure with dictionary as first value - bams = bams.map { id, bam_file -> - def meta = [ id: id, libtype: params.bam_library_type ] - tuple(meta, bam_file) - } } - // sort the bam files - sorted_bam = samtools_sort_bam( bams ) + + // Set structure with dictionary as first value + bams = bams.map { id, bam_file -> + def strand = params.strandedness + strand = (strand && strand.toUpperCase() != "AUTO") ? strand : null // if strandedness is set to auto, set it to null + + def meta = [ id: id, strandedness: strand ] + tuple(meta, bam_file) + } } + + // sort the bam files + Channel.empty().set{sorted_bam} + sorted_bam = samtools_sort_bam( bams ) + +// ---------------------------------------------------------------------------- +// DEAL WITH FASTQ FILES // ---------------------------------------------------------------------------- - // DEAL WITH FASTQ FILES // Perform AliNe alignment Channel.empty().set{aline_alignments_all} - if (path_reads || ( path_csv && has_fastq) ) { - - if ( path_csv && path_csv.endsWith('.csv') ){ - path_reads = path_csv + def fastq_list=[] + def aline_data_in = null + + if ( via_csv ){ + aline_data_in = path_reads + } + else { + // --------- FASTQ LIST CASE --------- + if( path_reads.indexOf(',') >= 0) { + // Cut into list with coma separator + str_list = path_reads.tokenize(',') + // loop over elements + str_list.each { + str_list2 = it.tokenize(' ') + str_list2.each { + if ( Rain_utilities.is_fastq( it ) ){ + if ( Rain_utilities.is_url(it) ) { + log.info "This fastq input is an URL: ${it}" + via_url = true + } + fastq_list.add(file(it)) // use file insted of File for URL + } + } + } + aline_data_in = fastq_list.join(',') // Join the list into a string for AliNe input (filtered to contain only fastq files) } - + else { + // --------- FASTQ FOLDER CASE --------- + File input_reads = new File(path_reads) + if(input_reads.exists()){ + if ( input_reads.isDirectory()) { + log.info "The input ${path_reads} is a folder! Check for fastq presence..." + // in case of folder provided, add a trailing slash if missing + def matchingFiles = input_reads.listFiles().findAll { + it.name ==~ /.*_(R?1|R?2)(_.*)?\.(fastq|fq)(\.gz)?$/ + } + if (matchingFiles) { + aline_data_in = "${input_reads}" + "/" // Aline will handle the folder + log.info "Fastq file found" + } else { + log.info "No fastq file found" + } + } + // --------- FASTQ FILE CASE --------- + else { + if ( Rain_utilities.is_fastq( path_reads ) ){ + log.info "The input ${path_reads} is a fastq file!" + if ( Rain_utilities.is_url(path_reads) ) { + log.info "This fastq input is an URL: ${path_reads}" + via_url = true + } + aline_data_in = path_reads // Use the file as is + } + } + } + } + } + Channel.empty().set{aline_alignments_all} + if (aline_data_in){ ALIGNMENT ( - 'Juke34/AliNe -r v1.4.0', // Select pipeline + 'Juke34/AliNe -r v1.5.1', // Select pipeline "${workflow.resume?'-resume':''} -profile ${aline_profile}", // workflow opts supplied as params for flexibility "-config ${params.aline_profiles}", - "--reads ${path_reads}", + "--reads ${aline_data_in}", genome, "--read_type ${params.read_type}", "--aligner ${params.aligner}", - "--library_type ${params.library_type}", + "--strandedness ${params.strandedness}", workflow.workDir.resolve('Juke34/AliNe').toUriString() ) @@ -387,50 +480,94 @@ workflow { } // Convert each BAM file into a tuple, with the base name as the first element .set { aline_alignments } // Store the channel - if (params.library_type.contains("auto") ) { - log.info "Library type is set to auto, extracting it from salmon output" - // GET TUPLE [ID, OUTPUT_SALMON_LIBTYPE] FILES - aline_alignments_all = ALIGNMENT.out.output + // SET CORRECT ID NAME WHEN CSV catched from CSV sample row + if (via_csv){ + csv_meta_ch + .join(aline_alignments) + .map{ name_AliNe, sample, strandedness, bam -> + return tuple (sample, bam) + } + .set { aline_alignments } + } + + // GET TUPLE [ID, OUTPUT_SALMON_LIBTYPE] FILES + if ( params.strandedness && params.strandedness.toUpperCase() != "AUTO" ) { + log.info "Strandedness type is set to ${params.strandedness}, no need to extract it from salmon output" + aline_alignments_all = aline_alignments.map { id, bam -> + def meta = [ id: id, strandedness: params.strandedness ] + tuple(meta, bam)} + } else { + log.info "Try to get strandedness from AliNe salmon output" + aline_libtype = ALIGNMENT.out.output .map { dir -> - files("$dir/salmon_libtype/*/*.json", checkIfExists: true) // Find BAM files inside the output directory + files("$dir/salmon_strandedness/*/*.json", checkIfExists: true) // Find BAM files inside the output directory } .flatten() // Ensure we emit each file separately .map { json -> def name = json.getParent().getName().split('_seqkit')[0] // Extract the base name of the BAM file. _seqkit is the separator. The name is in the fodler containing the json file. Why take this one? Because it is the same as teh bam name set by Aline. It will be used to sync both values tuple(name, json) } // Convert each BAM file into a tuple, with the base name as the first element - .set { aline_libtype } // Store the channel + // Extract the library type from the JSON file aline_libtype = extract_libtype(aline_libtype) - aline_alignments.join(aline_libtype) - .map { key, val1, val2 -> tuple(key, val1, val2) } + + // because id and name can be different in csv case, we add the aline name to the tuple + aline_alignments_tmp = aline_alignments.map { id, file -> + def aline_name = file.getName().split('_seqkit')[0] // Extract the base name of the BAM file. _seqkit is the separator. + tuple(aline_name, id, file) + } - } else { - log.info "Library type is set to ${params.library_type}, no need to extract it from salmon output" - aline_alignments_all = aline_alignments.map { name, bam -> tuple(name, bam, params.library_type) } - } + aline_alignments_tmp.join(aline_libtype, remainder: true) + .map { aline_name, id, file, lib -> + def meta = [ id: id, strandedness: lib ] + tuple(meta, file) + } // Join the two channels on the key (the name of the BAM file) + .set { aline_alignments_all } // Store the channel + + // Here strandedness is null if it was not guessed by AliNe. Try to catch it in CSV if csv case + if (via_csv) { + sample_with_strand = aline_alignments_all.filter { meta, reads -> meta.strandedness && meta.strandedness != 'auto' } + sample_no_strand = aline_alignments_all.filter { meta, reads -> !meta.strandedness || meta.strandedness == 'auto' } + + log.info "Try to get strandedness from CSV file" + // Get the strandedness from the csv file + csv_meta_ch.map { read_id, sample_id, strandedness -> + [sample_id, strandedness] + } + .set { csv_meta_ch_short } // Set the channel with the new strandedness + + channel_data_tojoin = sample_no_strand.map { meta, bam_path -> + [meta.id, [meta, bam_path]] + } + + channel_data_tojoin + .join(csv_meta_ch_short) + .map { id, data_pair, strandedness -> + def (meta, bam_path) = data_pair + meta.strandedness = strandedness + [meta, bam_path] + } + .set{sample_no_strand} - // transform [ID, BAM, LIBTYPE] into [[id: 'ID', libtype: 'LIBTYPE'], file('BAM')] - aline_alignments_all = aline_alignments_all.map { id, file, lib -> - def meta = [ id: id, libtype: lib ] - tuple(meta, file) + aline_alignments_all = sample_with_strand.concat(sample_no_strand) + } } } - // call rain + // MERGE ALINE BAM AND INPUT BAM TOGETHER tuple_sample_sortedbam = aline_alignments_all.mix(sorted_bam) log.info "The following bam file(s) will be processed by RAIN:" + tuple_sample_sortedbam.view() // STEP 1 QC with fastp ? - Channel.empty().set{logs} - + Channel.empty().set{logs} // stat on aligned reads fastqc_ali(tuple_sample_sortedbam, "ali") logs.concat(fastqc_ali.out).set{logs} // save log // remove duplicates gatk_markduplicates(tuple_sample_sortedbam) logs.concat(gatk_markduplicates.out.log).set{logs} // save log - // stat on bam without duplicates + // stat on bam without duplicates∂ fastqc_dup(gatk_markduplicates.out.tuple_sample_dedupbam, "dup") logs.concat(fastqc_dup.out).set{logs} // save log // Clip overlap @@ -471,7 +608,7 @@ workflow { default: exit(1, "Wrong edit site tool was passed") } - + }