Skip to content

handling duplicated sequences in output.fusions.fasta #38

@racng

Description

@racng

Hi, in the filtered fusions.fasta output run on a single sample, I found that the number of headers is larger than the number of unique sequences:

grep -v  '>' output.fusions.fasta | sort| uniq | wc -l
11429
grep '>' output.fusions.fasta | sort| uniq | wc -l
13330

Here I looked for the header of a random repeated sequence:

grep -B 1 AAAACCCTTCCCTCCCCCGCTCCCCCGGAAGTGCTTTTCCAAGATTCGGGCCGGAGAGAGGCCTTGTAGGCACAGCGGCTGAGACTCGATCTGCTCCAAGTAGGGGCTCCAGCGCGGGTCGGAGTCTGGGGGTTCGCGCCCGCCGACCCGCGCCCTGCTCCCTCTCAGCACCTGGGCGGACGACGTGAACGATCAAGAGAGAGGGACCATAGTAGATCACGAGAAAAGAGTCGACGTCATAAATCCCGTAGTAGAGACCGTCATGACGATTATTACAGAGAGAGAAGCAGAGAACGAGAGAGGCACCGGGATCGTGACCGAGACCGTGACCGAGAGCGTGACCGAGAGCGCGAATATCGTCATCGTTAGAAGCTGAAGGAAGAGGATCACCTTCCAAGACAAAACAGTCTTCATGGGGGAAAAATGACGCTTGTCCAGCAGTTTGCTTCTTGTGATTGAACTGAACCTGTAAGGATTCATGGATAAAATGAACAGGAATAGATCTGAATAAAGCAAATCTGCATAAATGGTAACCAGTAGCTCTACTTTTATTTTTTATGTTGCTTAACTGTTTTATTTGAAGGAAACCTGTGTGATTTAAAAAGTTATAGCTTTTGCAACTTTATTACTGGTTATATACATTTGGCCATTATGATGTGCAAGCAATTGGAAAAAAAGTCAAGTAAATGCTTGTTTTTGTAGTAGTTTGTTCTTGTTAAAAATGTTTATATGATAATGTCTGTAAACAGCATCACTTTGATTACAATAGATGTAGTGTTGTAATAAACTGTTTAATGGGG$ output.fusions.fasta 
>ENST00000547219.5_0:182_ENST00000266679.8_1611:2229
AAAACCCTTCCCTCCCCCGCTCCCCCGGAAGTGCTTTTCCAAGATTCGGGCCGGAGAGAGGCCTTGTAGGCACAGCGGCTGAGACTCGATCTGCTCCAAGTAGGGGCTCCAGCGCGGGTCGGAGTCTGGGGGTTCGCGCCCGCCGACCCGCGCCCTGCTCCCTCTCAGCACCTGGGCGGACGACGTGAACGATCAAGAGAGAGGGACCATAGTAGATCACGAGAAAAGAGTCGACGTCATAAATCCCGTAGTAGAGACCGTCATGACGATTATTACAGAGAGAGAAGCAGAGAACGAGAGAGGCACCGGGATCGTGACCGAGACCGTGACCGAGAGCGTGACCGAGAGCGCGAATATCGTCATCGTTAGAAGCTGAAGGAAGAGGATCACCTTCCAAGACAAAACAGTCTTCATGGGGGAAAAATGACGCTTGTCCAGCAGTTTGCTTCTTGTGATTGAACTGAACCTGTAAGGATTCATGGATAAAATGAACAGGAATAGATCTGAATAAAGCAAATCTGCATAAATGGTAACCAGTAGCTCTACTTTTATTTTTTATGTTGCTTAACTGTTTTATTTGAAGGAAACCTGTGTGATTTAAAAAGTTATAGCTTTTGCAACTTTATTACTGGTTATATACATTTGGCCATTATGATGTGCAAGCAATTGGAAAAAAAGTCAAGTAAATGCTTGTTTTTGTAGTAGTTTGTTCTTGTTAAAAATGTTTATATGATAATGTCTGTAAACAGCATCACTTTGATTACAATAGATGTAGTGTTGTAATAAACTGTTTAATGGGG
--
>ENST00000547219.5_0:182_ENST00000456847.7_1297:1915
AAAACCCTTCCCTCCCCCGCTCCCCCGGAAGTGCTTTTCCAAGATTCGGGCCGGAGAGAGGCCTTGTAGGCACAGCGGCTGAGACTCGATCTGCTCCAAGTAGGGGCTCCAGCGCGGGTCGGAGTCTGGGGGTTCGCGCCCGCCGACCCGCGCCCTGCTCCCTCTCAGCACCTGGGCGGACGACGTGAACGATCAAGAGAGAGGGACCATAGTAGATCACGAGAAAAGAGTCGACGTCATAAATCCCGTAGTAGAGACCGTCATGACGATTATTACAGAGAGAGAAGCAGAGAACGAGAGAGGCACCGGGATCGTGACCGAGACCGTGACCGAGAGCGTGACCGAGAGCGCGAATATCGTCATCGTTAGAAGCTGAAGGAAGAGGATCACCTTCCAAGACAAAACAGTCTTCATGGGGGAAAAATGACGCTTGTCCAGCAGTTTGCTTCTTGTGATTGAACTGAACCTGTAAGGATTCATGGATAAAATGAACAGGAATAGATCTGAATAAAGCAAATCTGCATAAATGGTAACCAGTAGCTCTACTTTTATTTTTTATGTTGCTTAACTGTTTTATTTGAAGGAAACCTGTGTGATTTAAAAAGTTATAGCTTTTGCAACTTTATTACTGGTTATATACATTTGGCCATTATGATGTGCAAGCAATTGGAAAAAAAGTCAAGTAAATGCTTGTTTTTGTAGTAGTTTGTTCTTGTTAAAAATGTTTATATGATAATGTCTGTAAACAGCATCACTTTGATTACAATAGATGTAGTGTTGTAATAAACTGTTTAATGGGG

The sequence is repeated because it can be ENST00000547219.5 pairing with ENST00000266679.8 or ENST00000456847.7. However, I found that ENST00000266679.8 and ENST00000456847.7 are transcripts for the same gene. This seems biologically redundant. Would it make sense to reduce the redundancy by converting ENST to ENSG, and then keep unique header+sequence pairs before proceeding to kallisto requant?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions