-
Notifications
You must be signed in to change notification settings - Fork 9
Open
Description
Hi, in the filtered fusions.fasta output run on a single sample, I found that the number of headers is larger than the number of unique sequences:
grep -v '>' output.fusions.fasta | sort| uniq | wc -l
11429
grep '>' output.fusions.fasta | sort| uniq | wc -l
13330
Here I looked for the header of a random repeated sequence:
grep -B 1 AAAACCCTTCCCTCCCCCGCTCCCCCGGAAGTGCTTTTCCAAGATTCGGGCCGGAGAGAGGCCTTGTAGGCACAGCGGCTGAGACTCGATCTGCTCCAAGTAGGGGCTCCAGCGCGGGTCGGAGTCTGGGGGTTCGCGCCCGCCGACCCGCGCCCTGCTCCCTCTCAGCACCTGGGCGGACGACGTGAACGATCAAGAGAGAGGGACCATAGTAGATCACGAGAAAAGAGTCGACGTCATAAATCCCGTAGTAGAGACCGTCATGACGATTATTACAGAGAGAGAAGCAGAGAACGAGAGAGGCACCGGGATCGTGACCGAGACCGTGACCGAGAGCGTGACCGAGAGCGCGAATATCGTCATCGTTAGAAGCTGAAGGAAGAGGATCACCTTCCAAGACAAAACAGTCTTCATGGGGGAAAAATGACGCTTGTCCAGCAGTTTGCTTCTTGTGATTGAACTGAACCTGTAAGGATTCATGGATAAAATGAACAGGAATAGATCTGAATAAAGCAAATCTGCATAAATGGTAACCAGTAGCTCTACTTTTATTTTTTATGTTGCTTAACTGTTTTATTTGAAGGAAACCTGTGTGATTTAAAAAGTTATAGCTTTTGCAACTTTATTACTGGTTATATACATTTGGCCATTATGATGTGCAAGCAATTGGAAAAAAAGTCAAGTAAATGCTTGTTTTTGTAGTAGTTTGTTCTTGTTAAAAATGTTTATATGATAATGTCTGTAAACAGCATCACTTTGATTACAATAGATGTAGTGTTGTAATAAACTGTTTAATGGGG$ output.fusions.fasta
>ENST00000547219.5_0:182_ENST00000266679.8_1611:2229
AAAACCCTTCCCTCCCCCGCTCCCCCGGAAGTGCTTTTCCAAGATTCGGGCCGGAGAGAGGCCTTGTAGGCACAGCGGCTGAGACTCGATCTGCTCCAAGTAGGGGCTCCAGCGCGGGTCGGAGTCTGGGGGTTCGCGCCCGCCGACCCGCGCCCTGCTCCCTCTCAGCACCTGGGCGGACGACGTGAACGATCAAGAGAGAGGGACCATAGTAGATCACGAGAAAAGAGTCGACGTCATAAATCCCGTAGTAGAGACCGTCATGACGATTATTACAGAGAGAGAAGCAGAGAACGAGAGAGGCACCGGGATCGTGACCGAGACCGTGACCGAGAGCGTGACCGAGAGCGCGAATATCGTCATCGTTAGAAGCTGAAGGAAGAGGATCACCTTCCAAGACAAAACAGTCTTCATGGGGGAAAAATGACGCTTGTCCAGCAGTTTGCTTCTTGTGATTGAACTGAACCTGTAAGGATTCATGGATAAAATGAACAGGAATAGATCTGAATAAAGCAAATCTGCATAAATGGTAACCAGTAGCTCTACTTTTATTTTTTATGTTGCTTAACTGTTTTATTTGAAGGAAACCTGTGTGATTTAAAAAGTTATAGCTTTTGCAACTTTATTACTGGTTATATACATTTGGCCATTATGATGTGCAAGCAATTGGAAAAAAAGTCAAGTAAATGCTTGTTTTTGTAGTAGTTTGTTCTTGTTAAAAATGTTTATATGATAATGTCTGTAAACAGCATCACTTTGATTACAATAGATGTAGTGTTGTAATAAACTGTTTAATGGGG
--
>ENST00000547219.5_0:182_ENST00000456847.7_1297:1915
AAAACCCTTCCCTCCCCCGCTCCCCCGGAAGTGCTTTTCCAAGATTCGGGCCGGAGAGAGGCCTTGTAGGCACAGCGGCTGAGACTCGATCTGCTCCAAGTAGGGGCTCCAGCGCGGGTCGGAGTCTGGGGGTTCGCGCCCGCCGACCCGCGCCCTGCTCCCTCTCAGCACCTGGGCGGACGACGTGAACGATCAAGAGAGAGGGACCATAGTAGATCACGAGAAAAGAGTCGACGTCATAAATCCCGTAGTAGAGACCGTCATGACGATTATTACAGAGAGAGAAGCAGAGAACGAGAGAGGCACCGGGATCGTGACCGAGACCGTGACCGAGAGCGTGACCGAGAGCGCGAATATCGTCATCGTTAGAAGCTGAAGGAAGAGGATCACCTTCCAAGACAAAACAGTCTTCATGGGGGAAAAATGACGCTTGTCCAGCAGTTTGCTTCTTGTGATTGAACTGAACCTGTAAGGATTCATGGATAAAATGAACAGGAATAGATCTGAATAAAGCAAATCTGCATAAATGGTAACCAGTAGCTCTACTTTTATTTTTTATGTTGCTTAACTGTTTTATTTGAAGGAAACCTGTGTGATTTAAAAAGTTATAGCTTTTGCAACTTTATTACTGGTTATATACATTTGGCCATTATGATGTGCAAGCAATTGGAAAAAAAGTCAAGTAAATGCTTGTTTTTGTAGTAGTTTGTTCTTGTTAAAAATGTTTATATGATAATGTCTGTAAACAGCATCACTTTGATTACAATAGATGTAGTGTTGTAATAAACTGTTTAATGGGG
The sequence is repeated because it can be ENST00000547219.5 pairing with ENST00000266679.8 or ENST00000456847.7. However, I found that ENST00000266679.8 and ENST00000456847.7 are transcripts for the same gene. This seems biologically redundant. Would it make sense to reduce the redundancy by converting ENST to ENSG, and then keep unique header+sequence pairs before proceeding to kallisto requant?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels