-
Notifications
You must be signed in to change notification settings - Fork 2
Fix @RG tag in sam/bam file output #196
Description
Issue 1
Currently FONDA does not discriminate between lanes of a single sample. All lanes receive identical @RG ID: tags
Approach
Since alignment are done on a per lane basis for DNA based workflows (eg DNACapVar_Fastq), add lane number to read group. This would align more to standard practice (link)
Example
sample_manifest.txt
| parameterType | shortName | Parameter1 | Parameter2 |
|---|---|---|---|
| fastqFile | SampleA | SampleA_S1_L001_R1_001.fastq.gz | SampleA_S1_L001_R2_001.fastq.gz |
| fastqFile | SampleA | SampleA_S2_L002_R1_001.fastq.gz | SampleA_S2_L002_R2_001.fastq.gz |
The @RG ID: tag would be:
| parameterType |
|---|
| fastqFile |
| fastqFile |
I would rather the lane numbers are iterated and appended onto the sample name:
SampleA+L001
rather than pulled out of the longest common substring of the sample's reads. This will make the lane numbering consecutive and easier to enforce because there will be no dependency on sample name prefixes.
Please let me know if this is clear.
Issue 2
All workflows should get the LB tag instead of only amplicon seq. Rationale follows previous, to align with current best practice.
fonda/src/main/java/com/epam/fonda/tools/impl/BwaSort.java
Lines 108 to 110 in 4a651ca
| .equals(configuration.getGlobalConfig().getPipelineInfo().getWorkflow()) | |
| ? String.format("\"@RG\\tID:%s\\tSM:%s\\tLB:%s\\tPL:Illumina\"", sampleName, sampleName, sampleName) | |
| : String.format("\"@RG\\tID:%s\\tSM:%s\\tLB:DNA\\tPL:Illumina\"", sampleName, sampleName); |
fonda/src/main/java/com/epam/fonda/tools/impl/NovoalignSort.java
Lines 117 to 119 in 4a651ca
| return isDnaAmpliconWorkflow(configuration) | |
| ? String.format("\'@RG\\tID:%s\\tSM:%s\\tLB:%s\\tPL:Illumina\'", sampleName, sampleName, sampleName) | |
| : String.format("\'@RG\\tID:%s\\tSM:%s\\tLB:DNA\\tPL:Illumina\'", sampleName, sampleName); |
Approach
Remove this check, use @RG\\tID:%s\\tSM:%s\\tLB:%s\\tPL:Illumina for all workflows.