Skip to content

Latest commit

 

History

History
505 lines (425 loc) · 44.3 KB

File metadata and controls

505 lines (425 loc) · 44.3 KB

How-Tos

Convert between chromosome names

Conversion tables

Format for mouse and human genomes

  • NCBI GenBank: accessions
    • Human: chromosomes 1-22, X, and Y = CM000[663-686].[version]
    • Mouse: chromosomes 1-19, X, and Y = CM00[0994-1014].[version]
    • Accessions for mitochondrial chromosomes and non-reference chromosomes/scaffolds (alternate loci, unlocalized scaffolds, unplaced scaffolds, fix patches, and novel patches) do not follow a linear ordering.
  • NCBI RefSeq: accessions
    • Human: chromosomes 1-22, X, and Y = NC_00000[1-24].[version]
    • Mouse: chromosomes 1-19, X, and Y = NC_0000[67-87].[version]
    • Accessions for mitochondrial chromosomes and non-reference chromosomes/scaffolds do not follow a linear ordering.
  • Ensembl
    • Reference chromsomes: 1-22 (human) or 1-19 (mouse) + X + Y+ MT
    • Alternate loci, novel patches, fix patches: assembly name
    • Unlocalized scaffolds and unplaced scaffolds: GenBank accession
  • UCSC
    • reference chromosomes: chr[#|X|Y|M]
    • unlocalized scaffolds, alternate loci scaffolds, fix loci scaffolds: chr[#|X|Y|M]_[GenBank accession]v[GenBank version]_[random|alt|fix]
    • unplaced scaffolds: chrUn_[GenBank accession]v[GenBank version]
  • (human and mouse only) GENCODE
    • reference chromosomes: UCSC-style chr[#|X|Y|M]
    • non-reference chromosomes/scaffolds: GenBank accession.version

GRCh38.p14 notes (see also Google Colab notebook)

  • GenBank assembly accessions start with "GCA", while RefSeq assembly accessions start with "GCF".
  • History: GRCh38 initial release in 2013 (GCA_000001405.15, GCF_000001405.26) contained 455 sequences (25 reference chromosomes + 127 unplaced scaffolds + 42 unlocalized scaffolds + 261 alternate loci). By Patch 14 (GCF_000001405.40, GCA_000001405.29), 4 GenBank accessions (described below) were dropped from the RefSeq assembly, while 164 fix patches and 90 novel patches have been added, for a total of 709 GenBank accessions and 705 RefSeq accessions.
    • KI270752.1 (unplaced scaffold): dropped in patch 13 from the RefSeq assembly "because it is hamster sequence derived from the human-hamster CHO cell line" [UCSC hg38 bigZip]
      • This sequence is still kept in the NCBI GenBank assembly. "Removal of this sequence from the GenBank assembly can only be done at the time of a new major assembly release." [GRC Issue HG-2587]
    • KI270825.1 (alternate locus), KI270721.1 (unlocalized scaffold), KI270734.1 (unlocalized scaffold): "contamination or obsolete" sequences dropped in patch 14 from the RefSeq assembly [UCSC hg38 bigZip]
  • Comparison with other assemblies to NCBI GenBank
    • NCBI RefSeq: excludes the 4 sequences above
    • Ensembl release 113: contains the 4 sequences above but excludes 3 fix patches MU273354.1, KN538374.1, and MU273386.1
    • UCSC: excludes the 4 sequences above, but contains 2 extra sequences KQ759759.1 and KQ759762.2
      • KQ759759 and KQ759762 (fix patches) were updated from version 1 to version 2 in patch 14
      • "Because of the difficulty of removing the old chroms chr11_KQ759759v1_fix and chr22_KQ759762v1_fix from all of the database tables and bigData files, custom tracks, and hubs, we are not dropping them from the UCSC hg38 patch 14 .2bit and chromInfo. However, we have dropped them from chromAlias to accord with the Genbank and Refseq official releases for patch14." [UCSC hg38 bigZip]
    • GENCODE: many non-reference sequences have no annotations

Which genome assembly to use for alignment

References

General guidelines

  • Unless the aligner is "ALT-aware" and can appropriately use alternate loci sequences, do not include the alternate loci sequences in alignment indices.
  • Include the unplaced and unlocalized scaffolds
    • This will prevent false alignment of reads from those genomic regions to the reference chromosomes.
  • Hard-mask duplicate regions
    • Example (human genome): the two PAR regions on chromosome Y, and duplicate copies of centromeric arrays and WGS on chromosomes 5, 14, 19, 21 & 22
  • An Epstein-Barr virus (EBV) sequence is often included "as a sink for alignment of reads that are often present in sequencing samples."

FASTA sequences and indices following these guidelines are termed "analysis sets":

Choosing blacklist regions

Referencess

Below, I've compiled blacklists from ENCODE and associated labs (Anshul Kundaje and the Boyle lab). For a more comprehensive and updated table of blacklists, see the excluderanges package.

Genome Blacklist version Links
hg19 v1 GitHub, ENCODE*
hg19 v2 GitHub
hg38 v1 Kundaje, GitHub
hg38 v2 GitHub
hg38 v3 ENCODE
mm10 v1 ENCODE, Kundaje, GitHub
mm10 v2 GitHub

* Confusingly, Anshul Kundaje's webpage lists the hg19 annotation file ENCFF001TDO as both Version 1 and Version 3. The file is identical to Version 1 of the hg19 blacklist on the Boyle Lab GitHub.

How the Blacklist program works

As of commit 61a04d2.

Blacklists are computed for one reference chromosome at a time. Consider a chromosome of length $N$ base pairs that we divide into bins of size 1000 bp (binSize). Each bin overlaps the next bin by 900 bp (binSize - binOverlap). For example, bin 0 represents the interval [0, 1000), bin 1 represents [100, 1100), etc. The number of total bins is $B = \mathrm{floor}\left(\frac{N - \mathrm{binSize}}{\mathrm{binOverlap}} + 1\right)$: the last $N \mathbin{%} \mathrm{binOverlap}$ base pairs of the chromosome are not included in a bin.

Working definitions

  • A position $i$ is considered "uniquely mappable" for read length $k$ if the sequence from position $i$ to $i + k$ does not appear elsewhere in the genome.
  • A read (or sequence) is considered "uniquely mappable" if the position $i$ to which the read aligns to is uniquely mappable for the read length of the read; otherwise it is considered "multimapping."
    • A "multimapping" read does not mean that the aligner found (nearly) equally good alignments for the read at multiple different genomic loci, but rather that the position that the read aligns to is not considered "uniquely mappable."

Parameters

  • binSize = 1000
  • binOverlap = 100
  • uniqueLength = 36: the presumed read length

Input

  • Mappability vector for the chromosome generated by Umap (typically run with k-mer sizes 24, 36, 50, and 100): $X \in {0, ..., 255}^N$
    • If $X_i = 0$, then position $i$ was not uniquely mappable for k-mer lengths tested.
    • If $X_i > 0$, then $X_i$ is the length of the shortest sequence (out of the k-mer lengths tested) starting at position $i$ that is uniquely mappable.
  • $L$ BAM files of input libraries from different experiments
    • The developer suggests using $L \geq 50$ input tracks. [GitHub issue 6]
    • Based on the lists of input tracks provided, these appear to be unfiltered alignment files and include reads that did not align.

Output: BED file containing regions marked as "Low Mappability" or "High Signal Region"

  • Such regions contain at least 1 99.9+ percentile bin (for the corresponding region type - either count of uniquely mappable reads or multimapping reads) and contiguously neighboring 99+ percentile bins (or 0 signal bins); regions that are < 20 kbp apart are merged.
  • "High Signal Region": contains at least 1 bin with abnormally high count of uniquely mappable reads
  • "Low Mappability": contains 1 or more bins with abnormally high count of "multimapping" reads, and no bins with abnormally high count of uniquely mappable reads

Key variables

  • binsMap (type = integer, length = $B$): number of uniquely mappable positions within each bin

  • inputData (type = SequenceData, length = $L$): stores the # of uniquely mappable or multimapping reads in each bin for each input library

    • binsInput (type = integer, length = $B$): # of reads in each bin that are uniquely mappable
      • There is a minor bug in the code such that the length of binsInput may be 1 less than $B$ when $N - \mathrm{binSize}$ is exactly divisible by binOverlap. Line 112 in blacklist.cpp

        for(int i = 0; i < tempCounts.size() - binSize; i+=binOverlap) {

        should be

        for(int i = 0; i <= tempCounts.size() - binSize; i+=binOverlap) {

        This "bug" is accounted for in the main() function (lines 412-414) by dropping the last element/bin of binsMap if binsMap is longer than binsInput. This means that up to the last $(N \mathbin{%} \mathrm{binOverlap}) + \mathrm{binOverlap}$ base pairs are left out of the blacklist analysis.

    • binsMultimapping (type = integer, length = $B$): # of reads in each bin that are not uniquely mappable
    • totalReads (type = integer): # of reads mapping to the chromosome being analyzed
  • readNormList (type = double, length = $B$): median (across the $L$ input libraries) of the quantile-normalized number of uniquely mappable reads per uniquely mappable positions for each bin

    $$\underset{\text{input libraries}}{\mathrm{median}} \left(\underset{\text{input libraries}}{\text{quantile-norm}} \left( U \right)\right)$$

    where $U_{b,l} = \frac{\mathrm{inputData[l].binsInput[b]}}{\mathrm{binsMap[b]}}$ is the number of uniquely mappable reads in input library $l$ divided by the number of uniquely mappable positions in bin $b$.

    • Quantile normalization makes the distribution of read counts identical across all input libraries (see also Wikipedia)
      • Procedure implemented in the Blacklist program: sort binsInput for each input library; set the $x$th-ranked value in each input library to the mean of $x$th-ranked values across all input libraries.
  • multiList (type = double, length = $B$): median (across the $L$ input libraries) of the quantile-normalized number of multimapping reads per million total reads for each bin

    $$\underset{\text{input libraries}}{\mathrm{median}} \left(\underset{\text{input libraries}}{\text{quantile-norm}} \left( M \right)\right)$$

    where $M_{b,l} = \frac{\mathrm{inputData[l].binsMultimapping[b]}}{\mathrm{inputData[l].totalReads}} \cdot 1000000$ is the number of multimapping reads in bin $b$ per million total reads in input library $l$.

    • Quantile normalization performed analogously as with readNormList but using the multimapping read distribution (binsMultimapping) from each input library instead of the uniquely mappable read distribution (binsInput).

Genome annotations

Human genome

Repeat Masking

Questions

  • What is the precise relationship between RepeatMasker and RepBase?

References and resources

UCSC

Download site: https://hgdownload.soe.ucsc.edu/downloads.html

UCSC Servers (US server URLs provided; see UCSC Genome Browser's documentation for URLs for servers in Europe and elsewhere)

  • HTTP server: hgdownload.soe.ucsc.edu/
  • FTP server: ftp://hgdownload.soe.ucsc.edu
  • MariaDB (MySQL) server: genome-mysql.soe.ucsc.edu

Organization of directories accessible on HTTP and FTP servers

  • goldenPath (https://hgdownload.soe.ucsc.edu/goldenPath/<genome> or ftp://hgdownload.soe.ucsc.edu/goldenPath/<genome>): UCSC genome annotations [UCSC Genome Browser User Guide]
    • Files in this directory are largely descriptive (showing where things are along the genome) rather than numeric.
      • Exceptions: conservation scores (under the phyloP[#]way and phastCons[#]way directories)
    • bigZips/: genome sequence, selected annotation files and updates

      "Files in this directory reflect the initial... release of the genome, the most current versions are in the "latest/" subdirectory"

      • RepeatMasker-masked genome FASTA files
      • <genome>.chrom.sizes
      • <genome>.chromAlias.txt
      • (Updated regularly) RefSeq mRNA multi-FASTA files
      • (Updated regularly) upstream1000/2000/5000: sequences 1000/2000/5000 bases upstream of annotated TSSs of RefSeq genes with annotated 5' UTRs.
    • chromosomes/: a FASTA file for each chromosome/scaffold from the initial genome assembly release (i.e., without any patches)
    • database/: annotation tables, where each table is represented by a .sql file containing the SQL commands used to create the table and a .txt.gz file of the table data in tab-delimited format. Schema descriptions can be found by selecting the relevant dataset and clicking the "Data format description" button in the Table Browser.
      • 1 table from the RepeatMasker track: rmsk
      • 6 tables from the RepeatMasker Viz. track: rmsk[Align|Out|Joined][Baseline|Current]
        • rmskAlign* and rmskOut*: very similar, except that Align corresponds to the ".align" file generated by RepeatMasker that shows the alignments between the repeat and query sequence, which is missing from the ".out" file.
        • rmsk*Baseline vs. rmsk*Current: based on the GitHub repo, appear to correspond to older vs. newer annotations
        • rmskJoined*: unclear
      • chromAlias.txt.gz
      • ... many others ...
  • gbdb (https://hgdownload.soe.ucsc.edu/gbdb/ or ftp://hgdownload.soe.ucsc.edu/gbdb/): bigBeds/bigWigs/BAMs and other binary files [UCSC Genome Browser Blog]
    • This includes essentially all functional genomics data, such as expression (e.g., RNA-seq data from the FANTOM and GTEx projects), transcription factor binding (e.g., ChIP-seq data from ENCODE and ReMap), and chromatin accessibility (e.g., DNase HS signal from ENCODE).
    • Integration of 3rd-party data: TCGA, variant effect prediction scores (e.g., from CADD), aberrant splicing scores (e.g., from AbSplice)
    • Other numeric tracks: GC content

Provenance of files: https://github.com/ucscGenomeBrowser/kent/tree/master/src/hg/makeDb/doc

NCI Genomic Data Commons

Abbreviations

  • NCI: National Cancer Institute
  • GDC: Genomic Data Commons

Documentation: https://docs.gdc.cancer.gov/

Searching and filtering data

Available fields: https://docs.gdc.cancer.gov/API/Users_Guide/Appendix_A_Available_Fields/

  • Definitions of each field can be found by searching the last component of the field name (i.e., after the last period in the field name) in the in the GDC Data Dictionary.

TCGA Barcode

Abbreviations

  • TSS: Tissue Source Site
  • BCR: Biospecimen Core Resource

Format: [project]-[TSS]-[participant]-[sample][vial]-[portion][analyte]-[plate]-[center]

When retrieving data via the GDC API, the TCGA Barcode can be found (if applicable) under the cases.samples.submitter_id.

Reference

  1. GDC Documentation: https://docs.gdc.cancer.gov/Encyclopedia/pages/TCGA_Barcode/
  2. TCGA Code Tables: https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables
  3. Wikipedia: https://en.wikipedia.org/wiki/The_Cancer_Genome_Atlas

NCBI Entrez and E-Utilities

E-Direct command line tutorials

Other references

NCBI Genomics Databases

Relationship between GEO, SRA, BioSample, and BioProject accessions

NCBI Database Study / Project Biological Sample Experiment / Technical Replicate Run
BioProject / BioSample BioProject (PRJNA) BioSample (SAMN)
Sequencing Read Archive SRA Study (SRP) SRA Sample (SRS) SRA Experiment (SRX) SRA Run (SRR)
Gene Expression Omnibus Database GEO Series (GSE) GEO Sample (GSM)
  • Columns: accession types in the same column usually map 1:1, except in the Study / Project column:
    • "The BioProject database defines two types of projects: 1) primary submission projects... are directly associated with submitted data... 2) umbrella projects, which reflect a higher-level organizational structure for larger initiatives or provide an additional level of data tracking." [NCBI Handbook] Therefore, one BioProject can "contain" another BioProject.
      • For example, ENCODE (which also has its own dedicated GEO listing page) has multiple layers of umbrella BioProjects: The human ENCODE (ENCyclopedia Of DNA Elements) project (PRJNA30707)
        • Pilot projects for the human ENCODE project (PRJNA13681)
          • 14 sub-projects ...
        • Production projects for the human ENCODE project (PRJNA63441)
    • Each BioProject and each SRA Study may be associated with multiple GEO Series.
      • Clearly, an umbrella BioProject may be associated with multiple SRA Studies (i.e., 1:many mapping). However, can a primary submission BioProject be associated with multiple SRA Studies?
      • A single SRA Study can be associated with many GEO Series. For example, the ENCODE SRA Study SRP012412 is associated with GEO Series GSE177866 and GSE231300, among many others.
  • Rows: The relationship is usually 1:many going across a row, left-to-right. However, many:many relationships exist:
    • As described in the SRA section below, a single SRA Sample (SRS) can be associated with multiple SRA Studies (SRP).
    • Similarly, multiple BioProjects can be linked to a single BioSample. [source]
  • References

Example

Consider the ENCODE Mint-ChIP-seq experiment ENCSR928PSU, which generated 2 technical replicates. Each technical replicate is associated with many read (FASTQ) files, which are not all the same read length. Below, I show the accessions associated with technical replicate 2.

NCBI Database Study / Project Biological Sample Experiment / Technical Replicate Run / Library
BioProject / BioSample PRJNA63443 SAMN19597277
Sequencing Read Archive SRP012412 SRS9223741 SRX11165854 8 runs: SRR14842522, ..., SRR14842529
Gene Expression Omnibus Database GSE177866 GSM5379091
ENCODE ENCBS832HZD ENCLB042NQH (ENCFF089NQO (R1), ENCFF521RSO (R2)), ..., (ENCFF785QUX (R1), ENCFF204YZK (R2))

Notes

  • Not all accessions have a useful dedicated page/interface and therefore are not hyperlinked above.
    • Example: Searching for a SRA Sample accession simply returns a list of associated SRA Experiments (e.g., SRS9223741).
    • Example: ENCODE pages for libraries from technical replicates (e.g., ENCLB042NQH) are merely JSON dumps; instead, technical replicates are best viewed from the experiment page (e.g., ENCSR928PSU).
  • The same SRA sample accession SRS9223741 is associated with 2 SRA experiments: SRX11165854 (technical replicate 2: see GSM5379091) and SRX11165855 (technical replicate 1: see GSM5379092). This demonstrates how the term "sample" is used differently by SRA / BioSample (referring to a biological sample) vs. GEO (referring to a technical replicate or library).
  • The concept of an ENCODE experiment accession (which may encompass multiple biological and technical replicates) does not appear to neatly correspond to any NCBI accession.

Gene Expression Omnibus (GEO)

NCBI's All Resources page splits GEO into 3 components:

  • GEO Database: main data repository, with GEO Samples organized into GEO Series.
  • GEO DataSets: "a curated collection of biologically and statistically comparable GEO Samples" [GEO Overview]; accession prefix = GDS
  • GEO Profiles: derived from GEO DataSets

Each of the 3 components has its own interface, but the GEO Database interface appears more limited and, as of 2025-12-03, is not selectable from the database/resource dropdown next to the search bar at the top of most NCBI pages. Instead, both Series and DataSets are searchable using the GEO DataSets interface.

NCBI Sequencing Read Archive (SRA)

Accession prefixes (see https://www.ncbi.nlm.nih.gov/sra/docs/submitmeta/)

  • STUDY: SRP#
  • SAMPLE (SRS#): can be shared between STUDYs and between EXPERIMENTs.
  • EXPERIMENT (SRX#): main publishable unit in the SRA database
    • Each EXPERIMENT represents a combination of biological replicate, library, sequencing strategy (e.g., targeted selection vs. unbiased), layout (e.g., paired end vs. single end), and instrument model.
  • RUN (SRR#): a "RUN is simply a manifest of data file(s) that are derived from sequencing a library described by the associated EXPERIMENT."
    • "All data files listed in a RUN will be merged into a single *.sra* archive file."
  • SUBMISSION (SRA#; non-public accession)

SRA data formats

  • References
  • SRA Normalized Format (*.sra; aka extract-transform-load or ETL format): contains base calls, full base quality scores, and alignments
    • Discards original read names [SRA Toolkit GitHub]
    • Storage: AWS (hot) via AWS Open Data Program --> free egress worldwide with anonymous identity
  • SRA Lite (*.sralite; aka ETL-BQS for ETL format without base quality scores): contains base calls, per-read quality flag, and alignments
    • Discards original read names [SRA Toolkit GitHub]
    • The per-read quality flag (Read_Filter) is either pass or reject. See SRA Documentation for how the SRA determines whether a read passes the read filter.
    • Storage
      • NCBI servers: free egress worldwide with anonymous identity
      • Cloud (AWS and GCP): hot; free egress to cloud services in the same geographical region
  • Originally submitted source files
    • Storage: AWS, mostly cold storage

Accessing SRA data

  • Web interface
    • Run Browser: Search by SRA Run Accession (SRR#) to see metadata, taxonomy analysis, read sequences, and data access information about the run, as well as a tool to download FASTA/FASTQ files for runs in the same SRA Experiment.
      • The Data access tab indicates where the data is stored (NCBI, AWS, or GCP servers) and what types of egress is free.
      • The FASTA/FASTQ download web interface only allows a limited download of <5 Gb of sequence over HTTP. [SRA Documentation]
    • Run Selector: Search by SRA, BioProject, BioSample, or GEO accessions to see all associated SRA Runs. Offers an interface to download metadata or retrieve the data from cold cloud storage to a cloud bucket.
  • AWS
    • Buckets available through the Registry of Open Data (s3://sra-pub-src-1/, s3://sra-pub-src-2/, and s3://sra-pub-run-odp/; see NIH NCBI Sequence Read Archive (SRA) on AWS) contain "hot" data that is free to download anonymously. Those buckets can be directly browsed anonymously using commands like
      aws s3 ls s3://sra-pub-src-1/ --no-sign-request
      
      and files can be downloaded directly via HTTP.
      • Example: Consider SRA Run DRR000110. The SRA Run Browser shows that the original FASTQ files are hosted in the S3 bucket sra-pub-src-1 and available for anonymouse, free egress worldwide. One can list the raw files associated with that run via
        aws s3 ls s3://sra-pub-src-1/DRR000110/ --no-sign-request
        
        and download the files using a command like
        aws s3 cp s3://sra-pub-src-1/DRR000110 . --no-sign-request --recursive
        
        to copy the raw files 090324_30WB8AAXX_s_3_sequence.txt.tar.gz.1 and 090324_30WB8AAXX_s_4_sequence.txt.tar.gz.1 into the current working directory. Extracting those archives yields two FASTQ files s_3_sequence.txt and s_4_sequence.txt.
        • As shown in the SRA Run Browser, the raw TAR archives can also be directly downloaded from their S3 buckets via HTTP.
        • Instead of using the AWS CLI, the SRA Toolkit also supports downloading the raw data directly via the --type argument. See below.
    • All other buckets that are shown in the Run Browser for any SRA Run appear to either be region-specific in their free egress support or host data in cold storage.
  • Downloading data from cold storage: use the "Create a Data Delivery order" page to retrieve the data into a user's cloud storage bucket. This will incur cloud storage costs for the user. The data can then be retrieved from the user cloud storage bucket, potentially incurring additional costs.
    • Follow the instructions on the "Create a Data Delivery order" page to adjust bucket permissions. I successfully retrieved data using the following permissions settings with a new S3 bucket:
      • Do not block public access (uncheck all "block public access" or "block all public access" boxes)
      • Copy the automatically generated bucket policy (JSON text) from the "Create a Data Delivery order" page to the "Bucket policy" section of the Permissions tab of the bucket.
      • Upon retrieval (which may take up to 48 hours), a metadata CSV file is deposited into the target bucket along with a folder (with the SRA Run accession as its name) containing the requested data.
    • If logged into your MyNCBI account, the "Create a Data Delivery order" page will show the status of recent data delivery orders from the last 30 days.
  • Download data from hot storage: download using the SRA Toolkit or using cloud APIs
    • Example: SRA Run DRR310659, which is availabe in SRA Normalized Format on GCP at gs://sra-pub-run-110/DRR310659/DRR310659.1 with free egress to gs.us-east1. (It is also available with free egress worldwide from NCBI servers, but for the sake of this example, we restrict ourselves to downloading from GCP.)
      • To download using the SRA Toolkit specifically from the GCP bucket (as opposed to other servers):
        fasterq-dump --location gs://sra-pub-run-110/DRR310659/DRR310659.1 DRR310659
        
        • The --location argument is explained in the help of fasterq-dump version 3.0.0 but not the latest SRA Toolkit version 3.1.1.
      • To download using the gcloud CLI:
        gcloud storage --billing-project=<billing-project> cp gs://sra-pub-run-110/DRR310659/DRR310659.1 .
        
        where <billing-project> is a project ID shown under the "ID" column at https://console.cloud.google.com/billing/projects. This downloads a SRA Normalized Format file that can be converted to FASTQ and other formats via the SRA Toolkit programs, such as
        fasterq-dump ./DRR310659
        
      • Costs: Presumably if these commands are run within a Google Cloud virtual machine (Cloud Shell or Compute Engine instance) located in a us-east1 region, then the download is free. However, if downloading to local premises, then an egress cost may be incurred.
    • If originally submitted source files are available in hot storage, they can be downloaded directly using the SRA Toolkit by using the --type argument.
      • Example: The Run Browser for SRA Run DRR000110 lists the original format files as type fastq. We already previously explored how to download the raw files via HTTP or the AWS CLI. To download using the SRA Toolkit, run
        prefetch --type fastq DRR000110
        
        which will create a folder DRR000110 with the 2 raw data files inside: 090324_30WB8AAXX_s_3_sequence.txt.tar.gz and 090324_30WB8AAXX_s_4_sequence.txt.tar.gz.

SRA Toolkit

Documentation: https://github.com/ncbi/sra-tools/wiki

  • Note (as of 2024-08-07): Because there are a lot of Wiki pages, some of them are initially hidden. Click on "Show 11 more pages..." to see all of them.

Building from source: see ncbi/sra-tools#937 (comment)

  • Using CMake:

    # set where to install the sratoolkit
    DIR_INSTALL="$HOME/local/sratoolkit"
    # set where to download source code and create build directory
    DIR_TMP="$HOME/tmp/scratch/sratoolkit_build"
    
    cd "$DIR_TMP"
    git clone https://github.com/ncbi/ncbi-vdb.git
    git clone https://github.com/ncbi/sra-tools.git
    mkdir build
    cd build
    cmake -S "$(cd ../ncbi-vdb; pwd)" -B ncbi-vdb
    cmake --build ncbi-vdb
    cmake -D VDB_LIBDIR="${PWD}/ncbi-vdb/lib" -D CMAKE_INSTALL_PREFIX="$DIR_INSTALL" -S "$(cd ../sra-tools; pwd)" -B sra-tools 
    cmake --build sra-tools --target install
    
    # binaries are installed to "$DIR_INSTALL/bin/
    
  • Using Autoconf

    DIR_WD="$(pwd -P)"
    mkdir sra_install
    mkdir sra_build
    mkdir sra_src
    cd sra_src
    git clone https://github.com/ncbi/ncbi-vdb.git
    git clone https://github.com/ncbi/sra-tools.git
    cd ncbi-vdb
    ./configure --build-prefix="$DIR_WD/sra_build" --prefix="$DIR_WD/sra_install"
    make
    make install
    cd ../sra-tools
    ./configure --build-prefix="$DIR_WD/sra_build" --prefix="$DIR_WD/sra_install"
    make
    make install
    
    # binaries are now available at "$DIR_WD/bin"
    

There are 2 basic ways to download data from the SRA with the SRA Toolkit:

  1. Prefetch and then extract to desired data type
  2. On demand

While prefetch and fasterq-dump are the main programs, the SRA Toolkit comes with all of the following tools in the bin directory where it is installed. Some useful commands are indicated below.

  • abi-dump
  • align-info
  • cache-mgr
  • check-corrupt
  • fasterq-dump
  • fastq-dump
  • illumina-dump
  • kcbmeta
  • ngs-pileup
  • prefetch
    • Use the --max-size argument to download more than 20 GB of data.
  • rcexplain
  • ref-variation
  • sam-dump
  • sff-dump
  • sra-info
  • srapath
  • sra-pileup
  • sra-search
  • sra-stat
  • sratools
  • test-sra
  • var-expand
  • vdb-config
    • The full configuration of the toolkit can be viewed by running vdb-config. It appears that the interactive configuration settings from running vdb-config -i are saved to ~/.ncbi/user-settings.mkfg.
    • The interactive form of vdb-config does not expose all settings, some of which can only be set via the command line. See https://github.com/ncbi/sra-tools/wiki/06.-Connection-Timeouts.
  • vdb-decrypt
  • vdb-dump
    • vdb-dump --info <accession>: show the size (in bytes) of the accession, among other information
  • vdb-encrypt
  • vdb-validate