-
|
Hi everyone, We are currently analyzing a very large long-read RNA dataset (from ONT) and would like to use Flair in our transcript and isoform discovery pipeline. However, due to the sheer size of our dataset, even after partitioning the merged BED file by chromosomes, the resulting file sizes are still substantial, ranging from 40 GB (chr1) to 3 GB (chr21). To reduce the BED file sizes and runtimes to more manageable levels for running Flair, we decided to further partition each chromosome into smaller chunks. For example, we created 40 chunks for chr1, with each chunk’s BED file being around 1 GB, allowing us to parallelize the process and analyze our dataset using Flair in a reasonable amount of time. We attempted to partition each BED file using the "split" function in Bash, then ran Flair collapse for each chunk. Afterward, we merged the resulting GTF files into a single file for each chromosome. However, using a smaller test dataset, we observed additional transcripts around the chunk boundaries. These transcripts would normally collapse into one but did not in our case since they were processed separately. Based on this, I have two questions:
I hope my questions are clear, and I am looking forward to your answers and a fruitful discussion around it! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 3 replies
-
|
This is a good approach. If you want to reduce weirdness around the boundaries, you could do a slightly more complex split of your bed files where you only split in areas with no read alignments. To really speed up the collapse with each chunk, I reccommend extracting the read names from each chunk, then generated matched subsets of the fastq reads and using those to feed into collapse. Since collapse has realignment steps, if you don't do this, each collapse on each chunk will realign all the reads. |
Beta Was this translation helpful? Give feedback.
This is a good approach. If you want to reduce weirdness around the boundaries, you could do a slightly more complex split of your bed files where you only split in areas with no read alignments. To really speed up the collapse with each chunk, I reccommend extracting the read names from each chunk, then generated matched subsets of the fastq reads and using those to feed into collapse. Since collapse has realignment steps, if you don't do this, each collapse on each chunk will realign all the reads.