Skip to content

fg-labs/fg-sra

Build License codecov

fg-sra

High-performance SRA-to-SAM/BAM converter, replacing NCBI's sam-dump with multi-threaded processing for significantly higher throughput.

Fulcrum Genomics

Visit us at Fulcrum Genomics to learn more about how we can power your Bioinformatics with fg-sra and beyond.

Overview

fg-sra converts NCBI SRA archives to SAM or BAM format. It uses FFI bindings to the NCBI VDB C library (libncbi-vdb) for reading SRA data and processes references in parallel for high throughput.

Key features:

  • Multi-threaded reference processing with ordered output
  • SAM and BAM output (BAM via multi-threaded BGZF compression)
  • gzip/bzip2 compression for SAM output
  • FASTA/FASTQ output modes
  • Region filtering by genomic coordinates
  • Quality quantization
  • Reference cache warming via cache-refs to avoid resolver failures under load
  • Mate cache for proper SAM flag and mate-pair information

The following sam-dump options are accepted but not yet supported:

  • --hide-identical — output = for bases matching reference
  • --with-md-flag — compute and output the MD tag
  • --rna-splicing / --rna-splice-level / --rna-splice-log — RNA splice detection

These require reference sequence access via VDB FFI that has not yet been implemented.

Installation

Building from source (vendored)

git clone --recurse-submodules https://github.com/fg-labs/fg-sra
cd fg-sra
cargo build --release

Prerequisites

  • Rust (stable toolchain)
  • CMake (for building the vendored ncbi-vdb C library)

The vendored feature is enabled by default, building ncbi-vdb from the git submodule automatically during cargo build.

Building with a pre-built VDB

To link against a system-installed ncbi-vdb instead of building from source:

export VDB_INCDIR=/path/to/ncbi-vdb/interfaces
export VDB_LIBDIR=/path/to/lib/containing/libncbi-vdb.a
cargo build --release --no-default-features

Usage

# Convert an SRA accession to SAM
fg-sra tosam SRR390728

# Convert to BAM
fg-sra tosam --output-format bam --output-file output.bam SRR390728

# Primary alignments only, with unaligned reads
fg-sra tosam -1 -u SRR390728

# Filter by region
fg-sra tosam --aligned-region chr1:1000000-2000000 SRR390728

# Multi-threaded with 8 threads
fg-sra tosam -t 8 SRR390728

For full usage, run:

fg-sra tosam --help

Pre-caching References

When running many fg-sra tosam conversions concurrently, the VDB reference resolver can fail under heavy load. Use cache-refs to serially pre-populate the local reference cache before launching concurrent conversions:

# Cache references for a list of accessions
fg-sra cache-refs SRR622461 SRR765989 SRR341578

# Then run conversions concurrently
parallel fg-sra tosam {} ::: SRR622461 SRR765989 SRR341578

The cache-refs command processes accessions sequentially, resolving all reference sequence dependencies via VDB and caching them locally. Subsequent tosam runs will find these references in the local cache without network access.

Performance

SRR20022182 converted to coordinate-sorted BAM (piped through samtools sort) completes in ~5s wall-clock time with ~400 MB peak memory. Use --threads to enable multi-threaded aligned read processing.

Workspace Structure

fg-sra/
├── crates/
│   ├── fg-sra-vdb-sys/    # Raw FFI bindings to libncbi-vdb
│   ├── fg-sra-vdb/        # Safe Rust wrappers over VDB
│   └── fg-sra/            # Binary crate (the converter)
└── vendor/
    └── ncbi-vdb/           # Vendored VDB library (git submodule)

Resources

Authors

Sponsors

Development of fg-sra is supported by Fulcrum Genomics.

Become a sponsor

Disclaimer

This software is under active development. While we make a best effort to test this software and to fix issues as they are reported, this software is provided as-is without any warranty (see the license for details). Please submit an issue, and better yet a pull request as well, if you discover a bug or identify a missing feature. Please contact Fulcrum Genomics if you are considering using this software or are interested in sponsoring its development.

About

High-performance SRA-to-SAM/BAM converter replacing NCBI sam-dump

Topics

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages