Running Flair with very large bed files. #391

fbyukgo · 2024-12-09T22:30:09Z

fbyukgo
Dec 9, 2024

Hi everyone,

We are currently analyzing a very large long-read RNA dataset (from ONT) and would like to use Flair in our transcript and isoform discovery pipeline. However, due to the sheer size of our dataset, even after partitioning the merged BED file by chromosomes, the resulting file sizes are still substantial, ranging from 40 GB (chr1) to 3 GB (chr21).

To reduce the BED file sizes and runtimes to more manageable levels for running Flair, we decided to further partition each chromosome into smaller chunks. For example, we created 40 chunks for chr1, with each chunk’s BED file being around 1 GB, allowing us to parallelize the process and analyze our dataset using Flair in a reasonable amount of time.

We attempted to partition each BED file using the "split" function in Bash, then ran Flair collapse for each chunk. Afterward, we merged the resulting GTF files into a single file for each chromosome. However, using a smaller test dataset, we observed additional transcripts around the chunk boundaries. These transcripts would normally collapse into one but did not in our case since they were processed separately.

Based on this, I have two questions:

Is this a valid or viable approach for running Flair collapse on chunked BED files as described here?
If the answer to the first question is yes, what would you recommend to either improve the partitioning method or make corrections later to avoid these additional transcripts and ensure proper transcript collapsing?

I hope my questions are clear, and I am looking forward to your answers and a fruitful discussion around it!

Answered by cafelton

Dec 9, 2024

This is a good approach. If you want to reduce weirdness around the boundaries, you could do a slightly more complex split of your bed files where you only split in areas with no read alignments. To really speed up the collapse with each chunk, I reccommend extracting the read names from each chunk, then generated matched subsets of the fastq reads and using those to feed into collapse. Since collapse has realignment steps, if you don't do this, each collapse on each chunk will realign all the reads.

View full answer

cafelton · 2024-12-09T22:56:10Z

cafelton
Dec 9, 2024
Maintainer

This is a good approach. If you want to reduce weirdness around the boundaries, you could do a slightly more complex split of your bed files where you only split in areas with no read alignments. To really speed up the collapse with each chunk, I reccommend extracting the read names from each chunk, then generated matched subsets of the fastq reads and using those to feed into collapse. Since collapse has realignment steps, if you don't do this, each collapse on each chunk will realign all the reads.

3 replies

fbyukgo Dec 10, 2024
Author

Thank you for the suggestion! It makes sense to cut on the areas where there are no read alignments, but I was wondering if there is an already streamlined way of doing this, or if we should write a custom script for it.

As for generating chunk-specific FASTQs, we have already tried this. However, I didn’t follow the approach you suggested (extracting read names and generating matched FASTQ files). Instead, for each chunk, I filtered the corresponding BAM files using the BED start-end coordinates for each chunk (via samtools), then used bamtofastq from bedtools to convert those chunk-specific BAM files to FASTQs for use in the collapse stage. We observed a significant performance improvement during the collapse stage with this method.

That said, your approach might be faster or more straightforward for generating chunk-specific FASTQ files, so I’ll try it as well! While I’m not fully aware of the current status or future plans for FLAIR development, it might be worthwhile to integrate this chunking approach into the main program, especially given the increasing sample size of long-read datasets.

cafelton Dec 10, 2024
Maintainer

We don't have a streamlined way of doing this, this is just a method I have recommended to other folks who deal with big files that seems to work for them. We're currently working on a revamp of FLAIR with a more streamlined pipeline that also allows more flexibility for big files, hope you can cope with this workaround until then.

fbyukgo Dec 12, 2024
Author

Thank you so much for the suggestions and I am looking forward to try out the new version FLAIR!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running Flair with very large bed files. #391

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Running Flair with very large bed files. #391

Uh oh!

Uh oh!

fbyukgo Dec 9, 2024

Replies: 1 comment · 3 replies

Uh oh!

cafelton Dec 9, 2024 Maintainer

Uh oh!

fbyukgo Dec 10, 2024 Author

Uh oh!

cafelton Dec 10, 2024 Maintainer

Uh oh!

fbyukgo Dec 12, 2024 Author

fbyukgo
Dec 9, 2024

Replies: 1 comment 3 replies

cafelton
Dec 9, 2024
Maintainer

fbyukgo Dec 10, 2024
Author

cafelton Dec 10, 2024
Maintainer

fbyukgo Dec 12, 2024
Author