- How-Tos
- Genome annotations
- UCSC
- NCI Genomic Data Commons
- NCBI Entrez and E-Utilities
- NCBI Genomics Databases
Conversion tables
- NCBI genome assembly reports: assembly name <> GenBank accession <> RefSeq accession <> UCSC
- "assembly name" is the my term for the
Sequence-Namecolumn in the assembly reports; these names appear in the FASTA comment of NCBI Nucleotide entries and are also used by Ensembl for alternate loci, novel patches, and fix patches (see below) - Example (human GRCh38): https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.29_GRCh38.p14/GCA_000001405.29_GRCh38.p14_assembly_report.txt
- Example (mouse GRCm39): https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/635/GCF_000001635.27_GRCm39/GCF_000001635.27_GRCm39_assembly_report.txt
- "assembly name" is the my term for the
- UCSC chromAlias: Ensembl <> RefSeq accesion <> UCSC (and possibly more)
- This is present in 2 formats:
<genome path>/database/chromAlias.txt.gz: a long-form table, with each row representing a single conversion between a UCSC-style name and another style name<genome path>/bigZips/<genome>.chromAlias.txt: a wide-form table, with each row representing a single chromosome and columns giving all the different names for that chromosome
- This is present in 2 formats:
Format for mouse and human genomes
- NCBI GenBank: accessions
- Human: chromosomes 1-22, X, and Y = CM000[663-686].[version]
- Mouse: chromosomes 1-19, X, and Y = CM00[0994-1014].[version]
- Accessions for mitochondrial chromosomes and non-reference chromosomes/scaffolds (alternate loci, unlocalized scaffolds, unplaced scaffolds, fix patches, and novel patches) do not follow a linear ordering.
- NCBI RefSeq: accessions
- Human: chromosomes 1-22, X, and Y = NC_00000[1-24].[version]
- Mouse: chromosomes 1-19, X, and Y = NC_0000[67-87].[version]
- Accessions for mitochondrial chromosomes and non-reference chromosomes/scaffolds do not follow a linear ordering.
- Ensembl
- Reference chromsomes: 1-22 (human) or 1-19 (mouse) + X + Y+ MT
- Alternate loci, novel patches, fix patches: assembly name
- Unlocalized scaffolds and unplaced scaffolds: GenBank accession
- UCSC
- reference chromosomes:
chr[#|X|Y|M] - unlocalized scaffolds, alternate loci scaffolds, fix loci scaffolds:
chr[#|X|Y|M]_[GenBank accession]v[GenBank version]_[random|alt|fix] - unplaced scaffolds:
chrUn_[GenBank accession]v[GenBank version]
- reference chromosomes:
- (human and mouse only) GENCODE
- reference chromosomes: UCSC-style
chr[#|X|Y|M] - non-reference chromosomes/scaffolds: GenBank accession.version
- reference chromosomes: UCSC-style
GRCh38.p14 notes (see also Google Colab notebook)
- GenBank assembly accessions start with "GCA", while RefSeq assembly accessions start with "GCF".
- History: GRCh38 initial release in 2013 (GCA_000001405.15, GCF_000001405.26) contained 455 sequences (25 reference chromosomes + 127 unplaced scaffolds + 42 unlocalized scaffolds + 261 alternate loci). By Patch 14 (GCF_000001405.40, GCA_000001405.29), 4 GenBank accessions (described below) were dropped from the RefSeq assembly, while 164 fix patches and 90 novel patches have been added, for a total of 709 GenBank accessions and 705 RefSeq accessions.
KI270752.1(unplaced scaffold): dropped in patch 13 from the RefSeq assembly "because it is hamster sequence derived from the human-hamster CHO cell line" [UCSC hg38 bigZip]- This sequence is still kept in the NCBI GenBank assembly. "Removal of this sequence from the GenBank assembly can only be done at the time of a new major assembly release." [GRC Issue HG-2587]
KI270825.1(alternate locus),KI270721.1(unlocalized scaffold),KI270734.1(unlocalized scaffold): "contamination or obsolete" sequences dropped in patch 14 from the RefSeq assembly [UCSC hg38 bigZip]
- Comparison with other assemblies to NCBI GenBank
- NCBI RefSeq: excludes the 4 sequences above
- Ensembl release 113: contains the 4 sequences above but excludes 3 fix patches
MU273354.1,KN538374.1, andMU273386.1 - UCSC: excludes the 4 sequences above, but contains 2 extra sequences
KQ759759.1andKQ759762.2KQ759759andKQ759762(fix patches) were updated from version 1 to version 2 in patch 14- "Because of the difficulty of removing the old chroms chr11_KQ759759v1_fix and chr22_KQ759762v1_fix from all of the database tables and bigData files, custom tracks, and hubs, we are not dropping them from the UCSC hg38 patch 14 .2bit and chromInfo. However, we have dropped them from chromAlias to accord with the Genbank and Refseq official releases for patch14." [UCSC hg38 bigZip]
- GENCODE: many non-reference sequences have no annotations
References
- https://lh3.github.io/2017/11/13/which-human-reference-genome-to-use
- https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.40_GRCh38.p14/GRCh38_major_release_seqs_for_alignment_pipelines/README_analysis_sets.txt
- https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/
General guidelines
- Unless the aligner is "ALT-aware" and can appropriately use alternate loci sequences, do not include the alternate loci sequences in alignment indices.
- Include the unplaced and unlocalized scaffolds
- This will prevent false alignment of reads from those genomic regions to the reference chromosomes.
- Hard-mask duplicate regions
- Example (human genome): the two PAR regions on chromosome Y, and duplicate copies of centromeric arrays and WGS on chromosomes 5, 14, 19, 21 & 22
- An Epstein-Barr virus (EBV) sequence is often included "as a sink for alignment of reads that are often present in sequencing samples."
FASTA sequences and indices following these guidelines are termed "analysis sets":
- GRCh38.p14: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.40_GRCh38.p14/GRCh38_major_release_seqs_for_alignment_pipelines/
- GRCm39: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/635/GCF_000001635.27_GRCm39/seqs_for_alignment_pipelines/
- GRCm38 (mm10): use the initial assembly release sequences, which contains no alternate loci [UCSC mm10 bigZips]
- T2T CHM13: https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=T2T/CHM13/assemblies/analysis_set/ (as pointed to in the T2T CHM13 GitHub README)
- The NCBI FTP folder for the T2T CHM13 genome does not contain an analysis set.
- Bowtie 2 (see the sidebar on the manual webpage) provides an index, but it is not masked. Consequently, would reads originating from repetitive/duplicate regions simply fail to align?
Referencess
- Papers
- Amemiya HM, Kundaje A, Boyle AP. The ENCODE Blacklist: Identification of Problematic Regions of the Genome. Sci Rep. 2019;9(1):9354. Published 2019 Jun 27. doi:10.1038/s41598-019-45839-z
- Ogata JD, Mu W, Davis ES, et al. excluderanges: exclusion sets for T2T-CHM13, GRCm39, and other genome assemblies. Bioinformatics. 2023;39(4):btad198. doi:10.1093/bioinformatics/btad198
- Anshul Kundaje's webpage: https://sites.google.com/site/anshulkundaje/projects/blacklists
- ENCODE Annotation File Set: https://www.encodeproject.org/annotations/ENCSR636HFF/
- Blacklist program written by Alan Boyle for the ENCODE project: https://github.com/Boyle-Lab/Blacklist
Below, I've compiled blacklists from ENCODE and associated labs (Anshul Kundaje and the Boyle lab). For a more comprehensive and updated table of blacklists, see the excluderanges package.
| Genome | Blacklist version | Links |
|---|---|---|
| hg19 | v1 | GitHub, ENCODE* |
| hg19 | v2 | GitHub |
| hg38 | v1 | Kundaje, GitHub |
| hg38 | v2 | GitHub |
| hg38 | v3 | ENCODE |
| mm10 | v1 | ENCODE, Kundaje, GitHub |
| mm10 | v2 | GitHub |
* Confusingly, Anshul Kundaje's webpage lists the hg19 annotation file ENCFF001TDO as both Version 1 and Version 3. The file is identical to Version 1 of the hg19 blacklist on the Boyle Lab GitHub.
As of commit 61a04d2.
Blacklists are computed for one reference chromosome at a time. Consider a chromosome of length binSize). Each bin overlaps the next bin by 900 bp (binSize - binOverlap). For example, bin 0 represents the interval [0, 1000), bin 1 represents [100, 1100), etc. The number of total bins is
Working definitions
- A position
$i$ is considered "uniquely mappable" for read length$k$ if the sequence from position$i$ to$i + k$ does not appear elsewhere in the genome. - A read (or sequence) is considered "uniquely mappable" if the position
$i$ to which the read aligns to is uniquely mappable for the read length of the read; otherwise it is considered "multimapping."- A "multimapping" read does not mean that the aligner found (nearly) equally good alignments for the read at multiple different genomic loci, but rather that the position that the read aligns to is not considered "uniquely mappable."
Parameters
binSize = 1000binOverlap = 100uniqueLength = 36: the presumed read length
Input
- Mappability vector for the chromosome generated by Umap (typically run with k-mer sizes 24, 36, 50, and 100):
$X \in {0, ..., 255}^N$ - If
$X_i = 0$ , then position$i$ was not uniquely mappable for k-mer lengths tested. - If
$X_i > 0$ , then$X_i$ is the length of the shortest sequence (out of the k-mer lengths tested) starting at position$i$ that is uniquely mappable.
- If
-
$L$ BAM files of input libraries from different experiments- The developer suggests using
$L \geq 50$ input tracks. [GitHub issue 6] - Based on the lists of input tracks provided, these appear to be unfiltered alignment files and include reads that did not align.
- The developer suggests using
Output: BED file containing regions marked as "Low Mappability" or "High Signal Region"
- Such regions contain at least 1 99.9+ percentile bin (for the corresponding region type - either count of uniquely mappable reads or multimapping reads) and contiguously neighboring 99+ percentile bins (or 0 signal bins); regions that are < 20 kbp apart are merged.
- "High Signal Region": contains at least 1 bin with abnormally high count of uniquely mappable reads
- "Low Mappability": contains 1 or more bins with abnormally high count of "multimapping" reads, and no bins with abnormally high count of uniquely mappable reads
Key variables
-
binsMap(type = integer, length =$B$ ): number of uniquely mappable positions within each bin -
inputData(type =SequenceData, length =$L$ ): stores the # of uniquely mappable or multimapping reads in each bin for each input library-
binsInput(type = integer, length =$B$ ): # of reads in each bin that are uniquely mappable-
There is a minor bug in the code such that the length of
binsInputmay be 1 less than$B$ when$N - \mathrm{binSize}$ is exactly divisible bybinOverlap. Line 112 in blacklist.cppfor(int i = 0; i < tempCounts.size() - binSize; i+=binOverlap) {should be
for(int i = 0; i <= tempCounts.size() - binSize; i+=binOverlap) {This "bug" is accounted for in the
main()function (lines 412-414) by dropping the last element/bin ofbinsMapifbinsMapis longer thanbinsInput. This means that up to the last$(N \mathbin{%} \mathrm{binOverlap}) + \mathrm{binOverlap}$ base pairs are left out of the blacklist analysis.
-
-
binsMultimapping(type = integer, length =$B$ ): # of reads in each bin that are not uniquely mappable -
totalReads(type = integer): # of reads mapping to the chromosome being analyzed
-
-
readNormList(type = double, length =$B$ ): median (across the$L$ input libraries) of the quantile-normalized number of uniquely mappable reads per uniquely mappable positions for each bin$$\underset{\text{input libraries}}{\mathrm{median}} \left(\underset{\text{input libraries}}{\text{quantile-norm}} \left( U \right)\right)$$ where
$U_{b,l} = \frac{\mathrm{inputData[l].binsInput[b]}}{\mathrm{binsMap[b]}}$ is the number of uniquely mappable reads in input library$l$ divided by the number of uniquely mappable positions in bin$b$ .- Quantile normalization makes the distribution of read counts identical across all input libraries (see also Wikipedia)
- Procedure implemented in the Blacklist program: sort
binsInputfor each input library; set the $x$th-ranked value in each input library to the mean of $x$th-ranked values across all input libraries.
- Procedure implemented in the Blacklist program: sort
- Quantile normalization makes the distribution of read counts identical across all input libraries (see also Wikipedia)
-
multiList(type = double, length =$B$ ): median (across the$L$ input libraries) of the quantile-normalized number of multimapping reads per million total reads for each bin$$\underset{\text{input libraries}}{\mathrm{median}} \left(\underset{\text{input libraries}}{\text{quantile-norm}} \left( M \right)\right)$$ where
$M_{b,l} = \frac{\mathrm{inputData[l].binsMultimapping[b]}}{\mathrm{inputData[l].totalReads}} \cdot 1000000$ is the number of multimapping reads in bin$b$ per million total reads in input library$l$ .- Quantile normalization performed analogously as with
readNormListbut using the multimapping read distribution (binsMultimapping) from each input library instead of the uniquely mappable read distribution (binsInput).
- Quantile normalization performed analogously as with
Human genome
- Single representative transcripts per gene: Ensembl Canonical and RefSeq Select are supersets of MANE Select
- MANE Select = set of Ensembl Canonical and RefSeq Select transcripts that are annotated identically in the RefSeq and the Ensembl-GENCODE gene sets and perfectly align to GRCh38
- (Ensembl release 112 or GENCODE release v46) All transcripts tagged with "MANE_Select" are also tagged with "Ensembl Canonical"
- (RefSeq release GCF_000001405.40-RS_2023_10) Transcripts tagged with "MANE Select" are not additionally tagged with "RefSeq Select"
- Versioning
- MANE --> Ensembl and NCBI releases: see README of MANE releases on NCBI's FTP server: https://ftp.ncbi.nlm.nih.gov/refseq/MANE/MANE_human/
- GENCODE <--> Ensembl version mapping can be found on GENCODE's website: https://www.gencodegenes.org/human/releases.html
Questions
- What is the precise relationship between RepeatMasker and RepBase?
References and resources
- De novo tutorials
- https://bioinformaticsworkbook.org/dataAnalysis/ComparativeGenomics/RepeatModeler_RepeatMasker.html
- Storer JM, Hubley R, Rosen J, Smit AFA. Curation Guidelines for de novo Generated Transposable Element Families. Current Protocols. 2021;1(6):e154. doi:10.1002/cpz1.154
- RepeatMasker tutorials
- Tarailo-Graovac M, Chen N. Using RepeatMasker to Identify Repetitive Elements in Genomic Sequences. Current Protocols in Bioinformatics. 2009;25(1):4.10.1-4.10.14. doi:10.1002/0471250953.bi0410s25
- Tempel S. Using and Understanding RepeatMasker. In: Bigot Y, ed. Mobile Genetic Elements: Protocols and Genomic Applications. Humana Press; 2012:29-51. doi:10.1007/978-1-61779-603-6_2
- https://training.galaxyproject.org/training-material/topics/genome-annotation/tutorials/repeatmasker/tutorial.html
- RepeatMasker manual online: https://www.animalgenome.org/bioinfo/resources/manuals/RepeatMasker.html
- Example publications using RepeatMasker
- See the supplement for the T2T paper on human repeat elements: Hoyt SJ, Storer JM, Hartley GA, et al. From telomere to telomere: The transcriptional and epigenetic state of human repeat elements. Science. 2022;376(6588):eabk3112. doi:10.1126/science.abk3112
- See File S8 from Mashanov V, Machado DJ, Reid R, Brouwer C, Kofsky J, Janies DA. Twinkle twinkle brittle star: the draft genome of Ophioderma brevispinum (Echinodermata: ophiuroidea) as a resource for regeneration research. BMC Genomics. 2022;23(1):574. doi:10.1186/s12864-022-08750-y
- RepeatModeler2 publication: Flynn JM, Hubley R, Goubert C, et al. RepeatModeler2 for automated genomic discovery of transposable element families. Proceedings of the National Academy of Sciences. 2020;117(17):9451-9457. doi:10.1073/pnas.1921046117
Download site: https://hgdownload.soe.ucsc.edu/downloads.html
UCSC Servers (US server URLs provided; see UCSC Genome Browser's documentation for URLs for servers in Europe and elsewhere)
- HTTP server:
hgdownload.soe.ucsc.edu/ - FTP server:
ftp://hgdownload.soe.ucsc.edu - MariaDB (MySQL) server:
genome-mysql.soe.ucsc.edu
Organization of directories accessible on HTTP and FTP servers
goldenPath(https://hgdownload.soe.ucsc.edu/goldenPath/<genome>orftp://hgdownload.soe.ucsc.edu/goldenPath/<genome>): UCSC genome annotations [UCSC Genome Browser User Guide]- Files in this directory are largely descriptive (showing where things are along the genome) rather than numeric.
- Exceptions: conservation scores (under the
phyloP[#]wayandphastCons[#]waydirectories)
- Exceptions: conservation scores (under the
- bigZips/: genome sequence, selected annotation files and updates
"Files in this directory reflect the initial... release of the genome, the most current versions are in the "latest/" subdirectory"
- RepeatMasker-masked genome FASTA files
- <genome>.chrom.sizes
- <genome>.chromAlias.txt
- (Updated regularly) RefSeq mRNA multi-FASTA files
- (Updated regularly) upstream1000/2000/5000: sequences 1000/2000/5000 bases upstream of annotated TSSs of RefSeq genes with annotated 5' UTRs.
- chromosomes/: a FASTA file for each chromosome/scaffold from the initial genome assembly release (i.e., without any patches)
- database/: annotation tables, where each table is represented by a
.sqlfile containing the SQL commands used to create the table and a.txt.gzfile of the table data in tab-delimited format. Schema descriptions can be found by selecting the relevant dataset and clicking the "Data format description" button in the Table Browser.- 1 table from the RepeatMasker track:
rmsk- rmsk.txt.gz: appears to be the main annotation file to use (e.g., as used in the Skipper pipeline; also used by the UCSCRepeatMasker Bioconductor AnnotationHub, see source code
./inst/scripts/make-data_UCSCRepeatMasker.R); source unclear (not clearly described in the GitHub repo)
- rmsk.txt.gz: appears to be the main annotation file to use (e.g., as used in the Skipper pipeline; also used by the UCSCRepeatMasker Bioconductor AnnotationHub, see source code
- 6 tables from the RepeatMasker Viz. track:
rmsk[Align|Out|Joined][Baseline|Current]- rmskAlign* and rmskOut*: very similar, except that Align corresponds to the ".align" file generated by RepeatMasker that shows the alignments between the repeat and query sequence, which is missing from the ".out" file.
- rmsk*Baseline vs. rmsk*Current: based on the GitHub repo, appear to correspond to older vs. newer annotations
- rmskJoined*: unclear
- chromAlias.txt.gz
- ... many others ...
- 1 table from the RepeatMasker track:
- Files in this directory are largely descriptive (showing where things are along the genome) rather than numeric.
gbdb(https://hgdownload.soe.ucsc.edu/gbdb/orftp://hgdownload.soe.ucsc.edu/gbdb/): bigBeds/bigWigs/BAMs and other binary files [UCSC Genome Browser Blog]- This includes essentially all functional genomics data, such as expression (e.g., RNA-seq data from the FANTOM and GTEx projects), transcription factor binding (e.g., ChIP-seq data from ENCODE and ReMap), and chromatin accessibility (e.g., DNase HS signal from ENCODE).
- Integration of 3rd-party data: TCGA, variant effect prediction scores (e.g., from CADD), aberrant splicing scores (e.g., from AbSplice)
- Other numeric tracks: GC content
Provenance of files: https://github.com/ucscGenomeBrowser/kent/tree/master/src/hg/makeDb/doc
- Example: code for main Human GRCh38 annotations = https://github.com/ucscGenomeBrowser/kent/blob/master/src/hg/makeDb/doc/hg38/hg38.txt
Abbreviations
- NCI: National Cancer Institute
- GDC: Genomic Data Commons
Documentation: https://docs.gdc.cancer.gov/
Available fields: https://docs.gdc.cancer.gov/API/Users_Guide/Appendix_A_Available_Fields/
- Definitions of each field can be found by searching the last component of the field name (i.e., after the last period in the field name) in the in the GDC Data Dictionary.
Abbreviations
- TSS: Tissue Source Site
- BCR: Biospecimen Core Resource
Format: [project]-[TSS]-[participant]-[sample][vial]-[portion][analyte]-[plate]-[center]
When retrieving data via the GDC API, the TCGA Barcode can be found (if applicable) under the cases.samples.submitter_id.
Reference
- GDC Documentation: https://docs.gdc.cancer.gov/Encyclopedia/pages/TCGA_Barcode/
- TCGA Code Tables: https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables
- Wikipedia: https://en.wikipedia.org/wiki/The_Cancer_Genome_Atlas
E-Direct command line tutorials
- NCBI Workshop: Accessing NCBI Biology Resources Using EDirect for Command Line Novices
- Jupyter Notebook: https://github.com/esallychang/CommandLine_EDirect_July2023
- NCBI Workshop: Downloading NCBI Biological Data and Creating Custom Reports Using the Command Line
- Jupyter Notebook: https://github.com/esallychang/CommandLine_CustomData_April2023
Other references
- NCBI C++ Toolkit documentation: https://ncbi.github.io/cxx-toolkit/
| NCBI Database | Study / Project | Biological Sample | Experiment / Technical Replicate | Run |
|---|---|---|---|---|
| BioProject / BioSample | BioProject (PRJNA) |
BioSample (SAMN) |
||
| Sequencing Read Archive | SRA Study (SRP) |
SRA Sample (SRS) |
SRA Experiment (SRX) |
SRA Run (SRR) |
| Gene Expression Omnibus Database | GEO Series (GSE) |
GEO Sample (GSM) |
- Columns: accession types in the same column usually map 1:1, except in the Study / Project column:
- "The BioProject database defines two types of projects: 1) primary submission projects... are directly associated with submitted data... 2) umbrella projects, which reflect a higher-level organizational structure for larger initiatives or provide an additional level of data tracking." [NCBI Handbook] Therefore, one BioProject can "contain" another BioProject.
- For example, ENCODE (which also has its own dedicated GEO listing page) has multiple layers of umbrella BioProjects: The human ENCODE (ENCyclopedia Of DNA Elements) project (PRJNA30707)
- Pilot projects for the human ENCODE project (PRJNA13681)
- 14 sub-projects ...
- Production projects for the human ENCODE project (PRJNA63441)
- Production ENCODE epigenomic data (PRJNA63443)
- Homo sapiens Epigenomics (PRJNA292727)
- Production ENCODE functional genomics data (PRJNA63447)
- Production ENCODE transcriptome data (PRJNA30709)
- Pilot projects for the human ENCODE project (PRJNA13681)
- For example, ENCODE (which also has its own dedicated GEO listing page) has multiple layers of umbrella BioProjects: The human ENCODE (ENCyclopedia Of DNA Elements) project (PRJNA30707)
- Each BioProject and each SRA Study may be associated with multiple GEO Series.
- Clearly, an umbrella BioProject may be associated with multiple SRA Studies (i.e., 1:many mapping). However, can a primary submission BioProject be associated with multiple SRA Studies?
- A single SRA Study can be associated with many GEO Series. For example, the ENCODE SRA Study SRP012412 is associated with GEO Series GSE177866 and GSE231300, among many others.
- "The BioProject database defines two types of projects: 1) primary submission projects... are directly associated with submitted data... 2) umbrella projects, which reflect a higher-level organizational structure for larger initiatives or provide an additional level of data tracking." [NCBI Handbook] Therefore, one BioProject can "contain" another BioProject.
- Rows: The relationship is usually 1:many going across a row, left-to-right. However, many:many relationships exist:
- References
- Relationship between SRA, BioProject, and BioSample: https://www.ncbi.nlm.nih.gov/sra/docs/submitmeta/, https://www.ncbi.nlm.nih.gov/sra/docs/submitquestions/
Consider the ENCODE Mint-ChIP-seq experiment ENCSR928PSU, which generated 2 technical replicates. Each technical replicate is associated with many read (FASTQ) files, which are not all the same read length. Below, I show the accessions associated with technical replicate 2.
| NCBI Database | Study / Project | Biological Sample | Experiment / Technical Replicate | Run / Library |
|---|---|---|---|---|
| BioProject / BioSample | PRJNA63443 | SAMN19597277 | ||
| Sequencing Read Archive | SRP012412 | SRS9223741 | SRX11165854 | 8 runs: SRR14842522, ..., SRR14842529 |
| Gene Expression Omnibus Database | GSE177866 | GSM5379091 | ||
| ENCODE | ENCBS832HZD | ENCLB042NQH | (ENCFF089NQO (R1), ENCFF521RSO (R2)), ..., (ENCFF785QUX (R1), ENCFF204YZK (R2)) |
Notes
- Not all accessions have a useful dedicated page/interface and therefore are not hyperlinked above.
- Example: Searching for a SRA Sample accession simply returns a list of associated SRA Experiments (e.g., SRS9223741).
- Example: ENCODE pages for libraries from technical replicates (e.g., ENCLB042NQH) are merely JSON dumps; instead, technical replicates are best viewed from the experiment page (e.g., ENCSR928PSU).
- The same SRA sample accession SRS9223741 is associated with 2 SRA experiments: SRX11165854 (technical replicate 2: see GSM5379091) and SRX11165855 (technical replicate 1: see GSM5379092). This demonstrates how the term "sample" is used differently by SRA / BioSample (referring to a biological sample) vs. GEO (referring to a technical replicate or library).
- The concept of an ENCODE experiment accession (which may encompass multiple biological and technical replicates) does not appear to neatly correspond to any NCBI accession.
NCBI's All Resources page splits GEO into 3 components:
- GEO Database: main data repository, with GEO Samples organized into GEO Series.
- GEO DataSets: "a curated collection of biologically and statistically comparable GEO Samples" [GEO Overview]; accession prefix =
GDS - GEO Profiles: derived from GEO DataSets
Each of the 3 components has its own interface, but the GEO Database interface appears more limited and, as of 2025-12-03, is not selectable from the database/resource dropdown next to the search bar at the top of most NCBI pages. Instead, both Series and DataSets are searchable using the GEO DataSets interface.
Accession prefixes (see https://www.ncbi.nlm.nih.gov/sra/docs/submitmeta/)
- STUDY:
SRP# - SAMPLE (
SRS#): can be shared between STUDYs and between EXPERIMENTs. - EXPERIMENT (
SRX#): main publishable unit in the SRA database- Each EXPERIMENT represents a combination of biological replicate, library, sequencing strategy (e.g., targeted selection vs. unbiased), layout (e.g., paired end vs. single end), and instrument model.
- RUN (
SRR#): a "RUN is simply a manifest of data file(s) that are derived from sequencing a library described by the associated EXPERIMENT."- "All data files listed in a RUN will be merged into a single *.sra* archive file."
- SUBMISSION (
SRA#; non-public accession)
SRA data formats
- References
- Overview: SRA Documentation
- Storage: SRA Archive Documentation, SRA Data Working Group 2021 report, and NLM Support (via email on 2024-08-08)
- SRA Normalized Format (
*.sra; aka extract-transform-load or ETL format): contains base calls, full base quality scores, and alignments- Discards original read names [SRA Toolkit GitHub]
- Storage: AWS (hot) via AWS Open Data Program --> free egress worldwide with anonymous identity
- SRA Lite (
*.sralite; aka ETL-BQS for ETL format without base quality scores): contains base calls, per-read quality flag, and alignments- Discards original read names [SRA Toolkit GitHub]
- The per-read quality flag (
Read_Filter) is eitherpassorreject. See SRA Documentation for how the SRA determines whether a read passes the read filter. - Storage
- NCBI servers: free egress worldwide with anonymous identity
- Cloud (AWS and GCP): hot; free egress to cloud services in the same geographical region
- Originally submitted source files
- Storage: AWS, mostly cold storage
Accessing SRA data
- Web interface
- Run Browser: Search by SRA Run Accession (
SRR#) to see metadata, taxonomy analysis, read sequences, and data access information about the run, as well as a tool to download FASTA/FASTQ files for runs in the same SRA Experiment.- The Data access tab indicates where the data is stored (NCBI, AWS, or GCP servers) and what types of egress is free.
- The FASTA/FASTQ download web interface only allows a limited download of <5 Gb of sequence over HTTP. [SRA Documentation]
- Run Selector: Search by SRA, BioProject, BioSample, or GEO accessions to see all associated SRA Runs. Offers an interface to download metadata or retrieve the data from cold cloud storage to a cloud bucket.
- Run Browser: Search by SRA Run Accession (
- AWS
- Buckets available through the Registry of Open Data (
s3://sra-pub-src-1/,s3://sra-pub-src-2/, ands3://sra-pub-run-odp/; see NIH NCBI Sequence Read Archive (SRA) on AWS) contain "hot" data that is free to download anonymously. Those buckets can be directly browsed anonymously using commands likeand files can be downloaded directly via HTTP.aws s3 ls s3://sra-pub-src-1/ --no-sign-request- Example: Consider SRA Run DRR000110. The SRA Run Browser shows that the original FASTQ files are hosted in the S3 bucket
sra-pub-src-1and available for anonymouse, free egress worldwide. One can list the raw files associated with that run viaand download the files using a command likeaws s3 ls s3://sra-pub-src-1/DRR000110/ --no-sign-requestto copy the raw filesaws s3 cp s3://sra-pub-src-1/DRR000110 . --no-sign-request --recursive090324_30WB8AAXX_s_3_sequence.txt.tar.gz.1and090324_30WB8AAXX_s_4_sequence.txt.tar.gz.1into the current working directory. Extracting those archives yields two FASTQ filess_3_sequence.txtands_4_sequence.txt.- As shown in the SRA Run Browser, the raw TAR archives can also be directly downloaded from their S3 buckets via HTTP.
- Instead of using the AWS CLI, the SRA Toolkit also supports downloading the raw data directly via the
--typeargument. See below.
- Example: Consider SRA Run DRR000110. The SRA Run Browser shows that the original FASTQ files are hosted in the S3 bucket
- All other buckets that are shown in the Run Browser for any SRA Run appear to either be region-specific in their free egress support or host data in cold storage.
- Buckets available through the Registry of Open Data (
- Downloading data from cold storage: use the "Create a Data Delivery order" page to retrieve the data into a user's cloud storage bucket. This will incur cloud storage costs for the user. The data can then be retrieved from the user cloud storage bucket, potentially incurring additional costs.
- Follow the instructions on the "Create a Data Delivery order" page to adjust bucket permissions. I successfully retrieved data using the following permissions settings with a new S3 bucket:
- Do not block public access (uncheck all "block public access" or "block all public access" boxes)
- Copy the automatically generated bucket policy (JSON text) from the "Create a Data Delivery order" page to the "Bucket policy" section of the Permissions tab of the bucket.
- Upon retrieval (which may take up to 48 hours), a metadata CSV file is deposited into the target bucket along with a folder (with the SRA Run accession as its name) containing the requested data.
- If logged into your MyNCBI account, the "Create a Data Delivery order" page will show the status of recent data delivery orders from the last 30 days.
- Follow the instructions on the "Create a Data Delivery order" page to adjust bucket permissions. I successfully retrieved data using the following permissions settings with a new S3 bucket:
- Download data from hot storage: download using the SRA Toolkit or using cloud APIs
- Example: SRA Run DRR310659, which is availabe in SRA Normalized Format on GCP at
gs://sra-pub-run-110/DRR310659/DRR310659.1with free egress togs.us-east1. (It is also available with free egress worldwide from NCBI servers, but for the sake of this example, we restrict ourselves to downloading from GCP.)- To download using the SRA Toolkit specifically from the GCP bucket (as opposed to other servers):
fasterq-dump --location gs://sra-pub-run-110/DRR310659/DRR310659.1 DRR310659- The
--locationargument is explained in the help offasterq-dumpversion 3.0.0 but not the latest SRA Toolkit version 3.1.1.
- The
- To download using the
gcloudCLI:wheregcloud storage --billing-project=<billing-project> cp gs://sra-pub-run-110/DRR310659/DRR310659.1 .<billing-project>is a project ID shown under the "ID" column at https://console.cloud.google.com/billing/projects. This downloads a SRA Normalized Format file that can be converted to FASTQ and other formats via the SRA Toolkit programs, such asfasterq-dump ./DRR310659 - Costs: Presumably if these commands are run within a Google Cloud virtual machine (Cloud Shell or Compute Engine instance) located in a
us-east1region, then the download is free. However, if downloading to local premises, then an egress cost may be incurred.
- To download using the SRA Toolkit specifically from the GCP bucket (as opposed to other servers):
- If originally submitted source files are available in hot storage, they can be downloaded directly using the SRA Toolkit by using the
--typeargument.- Example: The Run Browser for SRA Run DRR000110 lists the original format files as type
fastq. We already previously explored how to download the raw files via HTTP or the AWS CLI. To download using the SRA Toolkit, runwhich will create a folder DRR000110 with the 2 raw data files inside:prefetch --type fastq DRR000110090324_30WB8AAXX_s_3_sequence.txt.tar.gzand090324_30WB8AAXX_s_4_sequence.txt.tar.gz.
- Example: The Run Browser for SRA Run DRR000110 lists the original format files as type
- Example: SRA Run DRR310659, which is availabe in SRA Normalized Format on GCP at
Documentation: https://github.com/ncbi/sra-tools/wiki
- Note (as of 2024-08-07): Because there are a lot of Wiki pages, some of them are initially hidden. Click on "Show 11 more pages..." to see all of them.
Building from source: see ncbi/sra-tools#937 (comment)
-
Using CMake:
# set where to install the sratoolkit DIR_INSTALL="$HOME/local/sratoolkit" # set where to download source code and create build directory DIR_TMP="$HOME/tmp/scratch/sratoolkit_build" cd "$DIR_TMP" git clone https://github.com/ncbi/ncbi-vdb.git git clone https://github.com/ncbi/sra-tools.git mkdir build cd build cmake -S "$(cd ../ncbi-vdb; pwd)" -B ncbi-vdb cmake --build ncbi-vdb cmake -D VDB_LIBDIR="${PWD}/ncbi-vdb/lib" -D CMAKE_INSTALL_PREFIX="$DIR_INSTALL" -S "$(cd ../sra-tools; pwd)" -B sra-tools cmake --build sra-tools --target install # binaries are installed to "$DIR_INSTALL/bin/ -
Using Autoconf
DIR_WD="$(pwd -P)" mkdir sra_install mkdir sra_build mkdir sra_src cd sra_src git clone https://github.com/ncbi/ncbi-vdb.git git clone https://github.com/ncbi/sra-tools.git cd ncbi-vdb ./configure --build-prefix="$DIR_WD/sra_build" --prefix="$DIR_WD/sra_install" make make install cd ../sra-tools ./configure --build-prefix="$DIR_WD/sra_build" --prefix="$DIR_WD/sra_install" make make install # binaries are now available at "$DIR_WD/bin"
There are 2 basic ways to download data from the SRA with the SRA Toolkit:
- Prefetch and then extract to desired data type
- Tutorial: https://github.com/ncbi/sra-tools/wiki/08.-prefetch-and-fasterq-dump
- For in-depth documentation for the
fasterq-dumptool, see the pageHowTo: fasterq dump.
- On demand
While prefetch and fasterq-dump are the main programs, the SRA Toolkit comes with all of the following tools in the bin directory where it is installed. Some useful commands are indicated below.
abi-dumpalign-infocache-mgrcheck-corruptfasterq-dumpfastq-dumpillumina-dumpkcbmetangs-pileupprefetch- Use the
--max-sizeargument to download more than 20 GB of data.
- Use the
rcexplainref-variationsam-dumpsff-dumpsra-infosrapathsra-pileupsra-searchsra-statsratoolstest-sravar-expandvdb-config- The full configuration of the toolkit can be viewed by running
vdb-config. It appears that the interactive configuration settings from runningvdb-config -iare saved to~/.ncbi/user-settings.mkfg. - The interactive form of
vdb-configdoes not expose all settings, some of which can only be set via the command line. See https://github.com/ncbi/sra-tools/wiki/06.-Connection-Timeouts.
- The full configuration of the toolkit can be viewed by running
vdb-decryptvdb-dumpvdb-dump --info <accession>: show the size (in bytes) of the accession, among other information
vdb-encryptvdb-validate