How-Tos
Genome annotations
- Repeat Masking
UCSC
NCI Genomic Data Commons
- Searching and filtering data
- TCGA Barcode
NCBI Entrez and E-Utilities
NCBI Genomics Databases

How-Tos

Convert between chromosome names

Conversion tables

NCBI genome assembly reports: assembly name <> GenBank accession <> RefSeq accession <> UCSC
- "assembly name" is the my term for the Sequence-Name column in the assembly reports; these names appear in the FASTA comment of NCBI Nucleotide entries and are also used by Ensembl for alternate loci, novel patches, and fix patches (see below)
- Example (human GRCh38): https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.29_GRCh38.p14/GCA_000001405.29_GRCh38.p14_assembly_report.txt
- Example (mouse GRCm39): https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/635/GCF_000001635.27_GRCm39/GCF_000001635.27_GRCm39_assembly_report.txt
UCSC chromAlias: Ensembl <> RefSeq accesion <> UCSC (and possibly more)
- This is present in 2 formats:
  - <genome path>/database/chromAlias.txt.gz: a long-form table, with each row representing a single conversion between a UCSC-style name and another style name
  - <genome path>/bigZips/<genome>.chromAlias.txt: a wide-form table, with each row representing a single chromosome and columns giving all the different names for that chromosome

Format for mouse and human genomes

NCBI GenBank: accessions
- Human: chromosomes 1-22, X, and Y = CM000[663-686].[version]
- Mouse: chromosomes 1-19, X, and Y = CM00[0994-1014].[version]
- Accessions for mitochondrial chromosomes and non-reference chromosomes/scaffolds (alternate loci, unlocalized scaffolds, unplaced scaffolds, fix patches, and novel patches) do not follow a linear ordering.
NCBI RefSeq: accessions
- Human: chromosomes 1-22, X, and Y = NC_00000[1-24].[version]
- Mouse: chromosomes 1-19, X, and Y = NC_0000[67-87].[version]
- Accessions for mitochondrial chromosomes and non-reference chromosomes/scaffolds do not follow a linear ordering.
Ensembl
- Reference chromsomes: 1-22 (human) or 1-19 (mouse) + X + Y+ MT
- Alternate loci, novel patches, fix patches: assembly name
- Unlocalized scaffolds and unplaced scaffolds: GenBank accession
UCSC
- reference chromosomes: chr[#|X|Y|M]
- unlocalized scaffolds, alternate loci scaffolds, fix loci scaffolds: chr[#|X|Y|M]_[GenBank accession]v[GenBank version]_[random|alt|fix]
- unplaced scaffolds: chrUn_[GenBank accession]v[GenBank version]
(human and mouse only) GENCODE
- reference chromosomes: UCSC-style chr[#|X|Y|M]
- non-reference chromosomes/scaffolds: GenBank accession.version

GRCh38.p14 notes (see also Google Colab notebook)

GenBank assembly accessions start with "GCA", while RefSeq assembly accessions start with "GCF".
History: GRCh38 initial release in 2013 (GCA_000001405.15, GCF_000001405.26) contained 455 sequences (25 reference chromosomes + 127 unplaced scaffolds + 42 unlocalized scaffolds + 261 alternate loci). By Patch 14 (GCF_000001405.40, GCA_000001405.29), 4 GenBank accessions (described below) were dropped from the RefSeq assembly, while 164 fix patches and 90 novel patches have been added, for a total of 709 GenBank accessions and 705 RefSeq accessions.
- KI270752.1 (unplaced scaffold): dropped in patch 13 from the RefSeq assembly "because it is hamster sequence derived from the human-hamster CHO cell line" [UCSC hg38 bigZip]
  - This sequence is still kept in the NCBI GenBank assembly. "Removal of this sequence from the GenBank assembly can only be done at the time of a new major assembly release." [GRC Issue HG-2587]
- KI270825.1 (alternate locus), KI270721.1 (unlocalized scaffold), KI270734.1 (unlocalized scaffold): "contamination or obsolete" sequences dropped in patch 14 from the RefSeq assembly [UCSC hg38 bigZip]
Comparison with other assemblies to NCBI GenBank
- NCBI RefSeq: excludes the 4 sequences above
- Ensembl release 113: contains the 4 sequences above but excludes 3 fix patches MU273354.1, KN538374.1, and MU273386.1
- UCSC: excludes the 4 sequences above, but contains 2 extra sequences KQ759759.1 and KQ759762.2
  - KQ759759 and KQ759762 (fix patches) were updated from version 1 to version 2 in patch 14
  - "Because of the difficulty of removing the old chroms chr11_KQ759759v1_fix and chr22_KQ759762v1_fix from all of the database tables and bigData files, custom tracks, and hubs, we are not dropping them from the UCSC hg38 patch 14 .2bit and chromInfo. However, we have dropped them from chromAlias to accord with the Genbank and Refseq official releases for patch14." [UCSC hg38 bigZip]
- GENCODE: many non-reference sequences have no annotations

Which genome assembly to use for alignment

References

General guidelines

Unless the aligner is "ALT-aware" and can appropriately use alternate loci sequences, do not include the alternate loci sequences in alignment indices.
Include the unplaced and unlocalized scaffolds
- This will prevent false alignment of reads from those genomic regions to the reference chromosomes.
Hard-mask duplicate regions
- Example (human genome): the two PAR regions on chromosome Y, and duplicate copies of centromeric arrays and WGS on chromosomes 5, 14, 19, 21 & 22
An Epstein-Barr virus (EBV) sequence is often included "as a sink for alignment of reads that are often present in sequencing samples."

FASTA sequences and indices following these guidelines are termed "analysis sets":

GRCh38.p14: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.40_GRCh38.p14/GRCh38_major_release_seqs_for_alignment_pipelines/
GRCm39: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/635/GCF_000001635.27_GRCm39/seqs_for_alignment_pipelines/
GRCm38 (mm10): use the initial assembly release sequences, which contains no alternate loci [UCSC mm10 bigZips]
T2T CHM13: https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=T2T/CHM13/assemblies/analysis_set/ (as pointed to in the T2T CHM13 GitHub README)
- The NCBI FTP folder for the T2T CHM13 genome does not contain an analysis set.
- Bowtie 2 (see the sidebar on the manual webpage) provides an index, but it is not masked. Consequently, would reads originating from repetitive/duplicate regions simply fail to align?

Choosing blacklist regions

Referencess

Papers
- Amemiya HM, Kundaje A, Boyle AP. The ENCODE Blacklist: Identification of Problematic Regions of the Genome. Sci Rep. 2019;9(1):9354. Published 2019 Jun 27. doi:10.1038/s41598-019-45839-z
- Ogata JD, Mu W, Davis ES, et al. excluderanges: exclusion sets for T2T-CHM13, GRCm39, and other genome assemblies. Bioinformatics. 2023;39(4):btad198. doi:10.1093/bioinformatics/btad198
Anshul Kundaje's webpage: https://sites.google.com/site/anshulkundaje/projects/blacklists
ENCODE Annotation File Set: https://www.encodeproject.org/annotations/ENCSR636HFF/
Blacklist program written by Alan Boyle for the ENCODE project: https://github.com/Boyle-Lab/Blacklist

Below, I've compiled blacklists from ENCODE and associated labs (Anshul Kundaje and the Boyle lab). For a more comprehensive and updated table of blacklists, see the excluderanges package.

Genome	Blacklist version	Links
hg19	v1	GitHub, ENCODE*
hg19	v2	GitHub
hg38	v1	Kundaje, GitHub
hg38	v2	GitHub
hg38	v3	ENCODE
mm10	v1	ENCODE, Kundaje, GitHub
mm10	v2	GitHub

* Confusingly, Anshul Kundaje's webpage lists the hg19 annotation file ENCFF001TDO as both Version 1 and Version 3. The file is identical to Version 1 of the hg19 blacklist on the Boyle Lab GitHub.

How the Blacklist program works

As of commit 61a04d2.

Blacklists are computed for one reference chromosome at a time. Consider a chromosome of length $N$ base pairs that we divide into bins of size 1000 bp (binSize). Each bin overlaps the next bin by 900 bp (binSize - binOverlap). For example, bin 0 represents the interval [0, 1000), bin 1 represents [100, 1100), etc. The number of total bins is $B = \mathrm{floor}\left(\frac{N - \mathrm{binSize}}{\mathrm{binOverlap}} + 1\right)$: the last $N \mathbin{%} \mathrm{binOverlap}$ base pairs of the chromosome are not included in a bin.

Working definitions

A position $i$ is considered "uniquely mappable" for read length $k$ if the sequence from position $i$ to $i + k$ does not appear elsewhere in the genome.
A read (or sequence) is considered "uniquely mappable" if the position $i$ to which the read aligns to is uniquely mappable for the read length of the read; otherwise it is considered "multimapping."
- A "multimapping" read does not mean that the aligner found (nearly) equally good alignments for the read at multiple different genomic loci, but rather that the position that the read aligns to is not considered "uniquely mappable."

Parameters

binSize = 1000
binOverlap = 100
uniqueLength = 36: the presumed read length

Input

Mappability vector for the chromosome generated by Umap (typically run with k-mer sizes 24, 36, 50, and 100): $X \in {0, ..., 255}^N$
- If $X_i = 0$, then position $i$ was not uniquely mappable for k-mer lengths tested.
- If $X_i > 0$, then $X_i$ is the length of the shortest sequence (out of the k-mer lengths tested) starting at position $i$ that is uniquely mappable.
$L$ BAM files of input libraries from different experiments
- The developer suggests using $L \geq 50$ input tracks. [GitHub issue 6]
- Based on the lists of input tracks provided, these appear to be unfiltered alignment files and include reads that did not align.

Output: BED file containing regions marked as "Low Mappability" or "High Signal Region"

Such regions contain at least 1 99.9+ percentile bin (for the corresponding region type - either count of uniquely mappable reads or multimapping reads) and contiguously neighboring 99+ percentile bins (or 0 signal bins); regions that are < 20 kbp apart are merged.
"High Signal Region": contains at least 1 bin with abnormally high count of uniquely mappable reads
"Low Mappability": contains 1 or more bins with abnormally high count of "multimapping" reads, and no bins with abnormally high count of uniquely mappable reads

Key variables

binsMap (type = integer, length = $B$): number of uniquely mappable positions within each bin
inputData (type = SequenceData, length = $L$): stores the # of uniquely mappable or multimapping reads in each bin for each input library
- binsInput (type = integer, length = $B$): # of reads in each bin that are uniquely mappable
  - There is a minor bug in the code such that the length of binsInput may be 1 less than $B$ when $N - \mathrm{binSize}$ is exactly divisible by binOverlap. Line 112 in blacklist.cpp
    
    for(int i = 0; i < tempCounts.size() - binSize; i+=binOverlap) {
    
    should be
    
    for(int i = 0; i <= tempCounts.size() - binSize; i+=binOverlap) {
    
    This "bug" is accounted for in the main() function (lines 412-414) by dropping the last element/bin of binsMap if binsMap is longer than binsInput. This means that up to the last $(N \mathbin{%} \mathrm{binOverlap}) + \mathrm{binOverlap}$ base pairs are left out of the blacklist analysis.
- binsMultimapping (type = integer, length = $B$): # of reads in each bin that are not uniquely mappable
- totalReads (type = integer): # of reads mapping to the chromosome being analyzed
readNormList (type = double, length = $B$): median (across the $L$ input libraries) of the quantile-normalized number of uniquely mappable reads per uniquely mappable positions for each bin

$$\underset{\text{input libraries}}{\mathrm{median}} \left(\underset{\text{input libraries}}{\text{quantile-norm}} \left( U \right)\right)$$

where $U_{b,l} = \frac{\mathrm{inputData[l].binsInput[b]}}{\mathrm{binsMap[b]}}$ is the number of uniquely mappable reads in input library $l$ divided by the number of uniquely mappable positions in bin $b$.
- Quantile normalization makes the distribution of read counts identical across all input libraries (see also Wikipedia)
  - Procedure implemented in the Blacklist program: sort binsInput for each input library; set the $x$th-ranked value in each input library to the mean of $x$th-ranked values across all input libraries.
multiList (type = double, length = $B$): median (across the $L$ input libraries) of the quantile-normalized number of multimapping reads per million total reads for each bin

$$\underset{\text{input libraries}}{\mathrm{median}} \left(\underset{\text{input libraries}}{\text{quantile-norm}} \left( M \right)\right)$$

where $M_{b,l} = \frac{\mathrm{inputData[l].binsMultimapping[b]}}{\mathrm{inputData[l].totalReads}} \cdot 1000000$ is the number of multimapping reads in bin $b$ per million total reads in input library $l$.
- Quantile normalization performed analogously as with readNormList but using the multimapping read distribution (binsMultimapping) from each input library instead of the uniquely mappable read distribution (binsInput).

Genome annotations

Human genome

Single representative transcripts per gene: Ensembl Canonical and RefSeq Select are supersets of MANE Select
- MANE Select = set of Ensembl Canonical and RefSeq Select transcripts that are annotated identically in the RefSeq and the Ensembl-GENCODE gene sets and perfectly align to GRCh38
- (Ensembl release 112 or GENCODE release v46) All transcripts tagged with "MANE_Select" are also tagged with "Ensembl Canonical"
- (RefSeq release GCF_000001405.40-RS_2023_10) Transcripts tagged with "MANE Select" are not additionally tagged with "RefSeq Select"
- Versioning
  - MANE --> Ensembl and NCBI releases: see README of MANE releases on NCBI's FTP server: https://ftp.ncbi.nlm.nih.gov/refseq/MANE/MANE_human/
  - GENCODE <--> Ensembl version mapping can be found on GENCODE's website: https://www.gencodegenes.org/human/releases.html

Repeat Masking

Questions

What is the precise relationship between RepeatMasker and RepBase?

References and resources

De novo tutorials
- https://bioinformaticsworkbook.org/dataAnalysis/ComparativeGenomics/RepeatModeler_RepeatMasker.html
- Storer JM, Hubley R, Rosen J, Smit AFA. Curation Guidelines for de novo Generated Transposable Element Families. Current Protocols. 2021;1(6):e154. doi:10.1002/cpz1.154
RepeatMasker tutorials
- Tarailo-Graovac M, Chen N. Using RepeatMasker to Identify Repetitive Elements in Genomic Sequences. Current Protocols in Bioinformatics. 2009;25(1):4.10.1-4.10.14. doi:10.1002/0471250953.bi0410s25
- Tempel S. Using and Understanding RepeatMasker. In: Bigot Y, ed. Mobile Genetic Elements: Protocols and Genomic Applications. Humana Press; 2012:29-51. doi:10.1007/978-1-61779-603-6_2
- https://training.galaxyproject.org/training-material/topics/genome-annotation/tutorials/repeatmasker/tutorial.html
RepeatMasker manual online: https://www.animalgenome.org/bioinfo/resources/manuals/RepeatMasker.html
Example publications using RepeatMasker
- See the supplement for the T2T paper on human repeat elements: Hoyt SJ, Storer JM, Hartley GA, et al. From telomere to telomere: The transcriptional and epigenetic state of human repeat elements. Science. 2022;376(6588):eabk3112. doi:10.1126/science.abk3112
- See File S8 from Mashanov V, Machado DJ, Reid R, Brouwer C, Kofsky J, Janies DA. Twinkle twinkle brittle star: the draft genome of Ophioderma brevispinum (Echinodermata: ophiuroidea) as a resource for regeneration research. BMC Genomics. 2022;23(1):574. doi:10.1186/s12864-022-08750-y
RepeatModeler2 publication: Flynn JM, Hubley R, Goubert C, et al. RepeatModeler2 for automated genomic discovery of transposable element families. Proceedings of the National Academy of Sciences. 2020;117(17):9451-9457. doi:10.1073/pnas.1921046117

UCSC

Download site: https://hgdownload.soe.ucsc.edu/downloads.html

UCSC Servers (US server URLs provided; see UCSC Genome Browser's documentation for URLs for servers in Europe and elsewhere)

HTTP server: hgdownload.soe.ucsc.edu/
FTP server: ftp://hgdownload.soe.ucsc.edu
MariaDB (MySQL) server: genome-mysql.soe.ucsc.edu

Organization of directories accessible on HTTP and FTP servers

goldenPath (https://hgdownload.soe.ucsc.edu/goldenPath/<genome> or ftp://hgdownload.soe.ucsc.edu/goldenPath/<genome>): UCSC genome annotations [UCSC Genome Browser User Guide]
- Files in this directory are largely descriptive (showing where things are along the genome) rather than numeric.
  - Exceptions: conservation scores (under the phyloP[#]way and phastCons[#]way directories)
- bigZips/: genome sequence, selected annotation files and updates
  
  "Files in this directory reflect the initial... release of the genome, the most current versions are in the "latest/" subdirectory"
  - RepeatMasker-masked genome FASTA files
  - <genome>.chrom.sizes
  - <genome>.chromAlias.txt
  - (Updated regularly) RefSeq mRNA multi-FASTA files
  - (Updated regularly) upstream1000/2000/5000: sequences 1000/2000/5000 bases upstream of annotated TSSs of RefSeq genes with annotated 5' UTRs.
- chromosomes/: a FASTA file for each chromosome/scaffold from the initial genome assembly release (i.e., without any patches)
- database/: annotation tables, where each table is represented by a .sql file containing the SQL commands used to create the table and a .txt.gz file of the table data in tab-delimited format. Schema descriptions can be found by selecting the relevant dataset and clicking the "Data format description" button in the Table Browser.
  - 1 table from the RepeatMasker track: rmsk
    - rmsk.txt.gz: appears to be the main annotation file to use (e.g., as used in the Skipper pipeline; also used by the UCSCRepeatMasker Bioconductor AnnotationHub, see source code ./inst/scripts/make-data_UCSCRepeatMasker.R); source unclear (not clearly described in the GitHub repo)
  - 6 tables from the RepeatMasker Viz. track: rmsk[Align|Out|Joined][Baseline|Current]
    - rmskAlign* and rmskOut*: very similar, except that Align corresponds to the ".align" file generated by RepeatMasker that shows the alignments between the repeat and query sequence, which is missing from the ".out" file.
    - rmsk*Baseline vs. rmsk*Current: based on the GitHub repo, appear to correspond to older vs. newer annotations
    - rmskJoined*: unclear
  - chromAlias.txt.gz
  - ... many others ...
gbdb (https://hgdownload.soe.ucsc.edu/gbdb/ or ftp://hgdownload.soe.ucsc.edu/gbdb/): bigBeds/bigWigs/BAMs and other binary files [UCSC Genome Browser Blog]
- This includes essentially all functional genomics data, such as expression (e.g., RNA-seq data from the FANTOM and GTEx projects), transcription factor binding (e.g., ChIP-seq data from ENCODE and ReMap), and chromatin accessibility (e.g., DNase HS signal from ENCODE).
- Integration of 3rd-party data: TCGA, variant effect prediction scores (e.g., from CADD), aberrant splicing scores (e.g., from AbSplice)
- Other numeric tracks: GC content

Provenance of files: https://github.com/ucscGenomeBrowser/kent/tree/master/src/hg/makeDb/doc

Example: code for main Human GRCh38 annotations = https://github.com/ucscGenomeBrowser/kent/blob/master/src/hg/makeDb/doc/hg38/hg38.txt

NCI Genomic Data Commons

Abbreviations

NCI: National Cancer Institute
GDC: Genomic Data Commons

Documentation: https://docs.gdc.cancer.gov/

Searching and filtering data

Available fields: https://docs.gdc.cancer.gov/API/Users_Guide/Appendix_A_Available_Fields/

Definitions of each field can be found by searching the last component of the field name (i.e., after the last period in the field name) in the in the GDC Data Dictionary.

TCGA Barcode

Abbreviations

TSS: Tissue Source Site
BCR: Biospecimen Core Resource

Format: [project]-[TSS]-[participant]-[sample][vial]-[portion][analyte]-[plate]-[center]

When retrieving data via the GDC API, the TCGA Barcode can be found (if applicable) under the cases.samples.submitter_id.

Reference

GDC Documentation: https://docs.gdc.cancer.gov/Encyclopedia/pages/TCGA_Barcode/
TCGA Code Tables: https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables
Wikipedia: https://en.wikipedia.org/wiki/The_Cancer_Genome_Atlas

NCBI Entrez and E-Utilities

E-Direct command line tutorials

Other references

NCBI C++ Toolkit documentation: https://ncbi.github.io/cxx-toolkit/

NCBI Genomics Databases

Relationship between GEO, SRA, BioSample, and BioProject accessions

NCBI Database	Study / Project	Biological Sample	Experiment / Technical Replicate	Run
BioProject / BioSample	BioProject (`PRJNA`)	BioSample (`SAMN`)
Sequencing Read Archive	SRA Study (`SRP`)	SRA Sample (`SRS`)	SRA Experiment (`SRX`)	SRA Run (`SRR`)
Gene Expression Omnibus Database	GEO Series (`GSE`)		GEO Sample (`GSM`)

Columns: accession types in the same column usually map 1:1, except in the Study / Project column:
- "The BioProject database defines two types of projects: 1) primary submission projects... are directly associated with submitted data... 2) umbrella projects, which reflect a higher-level organizational structure for larger initiatives or provide an additional level of data tracking." [NCBI Handbook] Therefore, one BioProject can "contain" another BioProject.
  - For example, ENCODE (which also has its own dedicated GEO listing page) has multiple layers of umbrella BioProjects: The human ENCODE (ENCyclopedia Of DNA Elements) project (PRJNA30707)
    - Pilot projects for the human ENCODE project (PRJNA13681)
      - 14 sub-projects ...
    - Production projects for the human ENCODE project (PRJNA63441)
      - Production ENCODE epigenomic data (PRJNA63443)
      - Homo sapiens Epigenomics (PRJNA292727)
      - Production ENCODE functional genomics data (PRJNA63447)
      - Production ENCODE transcriptome data (PRJNA30709)
- Each BioProject and each SRA Study may be associated with multiple GEO Series.
  - Clearly, an umbrella BioProject may be associated with multiple SRA Studies (i.e., 1:many mapping). However, can a primary submission BioProject be associated with multiple SRA Studies?
  - A single SRA Study can be associated with many GEO Series. For example, the ENCODE SRA Study SRP012412 is associated with GEO Series GSE177866 and GSE231300, among many others.
Rows: The relationship is usually 1:many going across a row, left-to-right. However, many:many relationships exist:
- As described in the SRA section below, a single SRA Sample (SRS) can be associated with multiple SRA Studies (SRP).
- Similarly, multiple BioProjects can be linked to a single BioSample. [source]
References
- Relationship between SRA, BioProject, and BioSample: https://www.ncbi.nlm.nih.gov/sra/docs/submitmeta/, https://www.ncbi.nlm.nih.gov/sra/docs/submitquestions/

Example

Consider the ENCODE Mint-ChIP-seq experiment ENCSR928PSU, which generated 2 technical replicates. Each technical replicate is associated with many read (FASTQ) files, which are not all the same read length. Below, I show the accessions associated with technical replicate 2.

NCBI Database	Study / Project	Biological Sample	Experiment / Technical Replicate	Run / Library
BioProject / BioSample	PRJNA63443	SAMN19597277
Sequencing Read Archive	SRP012412	SRS9223741	SRX11165854	8 runs: SRR14842522, ..., SRR14842529
Gene Expression Omnibus Database	GSE177866		GSM5379091
ENCODE		ENCBS832HZD	ENCLB042NQH	(ENCFF089NQO (R1), ENCFF521RSO (R2)), ..., (ENCFF785QUX (R1), ENCFF204YZK (R2))

Notes

Not all accessions have a useful dedicated page/interface and therefore are not hyperlinked above.
- Example: Searching for a SRA Sample accession simply returns a list of associated SRA Experiments (e.g., SRS9223741).
- Example: ENCODE pages for libraries from technical replicates (e.g., ENCLB042NQH) are merely JSON dumps; instead, technical replicates are best viewed from the experiment page (e.g., ENCSR928PSU).
The same SRA sample accession SRS9223741 is associated with 2 SRA experiments: SRX11165854 (technical replicate 2: see GSM5379091) and SRX11165855 (technical replicate 1: see GSM5379092). This demonstrates how the term "sample" is used differently by SRA / BioSample (referring to a biological sample) vs. GEO (referring to a technical replicate or library).
The concept of an ENCODE experiment accession (which may encompass multiple biological and technical replicates) does not appear to neatly correspond to any NCBI accession.

Gene Expression Omnibus (GEO)

NCBI's All Resources page splits GEO into 3 components:

GEO Database: main data repository, with GEO Samples organized into GEO Series.
GEO DataSets: "a curated collection of biologically and statistically comparable GEO Samples" [GEO Overview]; accession prefix = GDS
GEO Profiles: derived from GEO DataSets

Each of the 3 components has its own interface, but the GEO Database interface appears more limited and, as of 2025-12-03, is not selectable from the database/resource dropdown next to the search bar at the top of most NCBI pages. Instead, both Series and DataSets are searchable using the GEO DataSets interface.

NCBI Sequencing Read Archive (SRA)

Accession prefixes (see https://www.ncbi.nlm.nih.gov/sra/docs/submitmeta/)

STUDY: SRP#
SAMPLE (SRS#): can be shared between STUDYs and between EXPERIMENTs.
EXPERIMENT (SRX#): main publishable unit in the SRA database
- Each EXPERIMENT represents a combination of biological replicate, library, sequencing strategy (e.g., targeted selection vs. unbiased), layout (e.g., paired end vs. single end), and instrument model.
RUN (SRR#): a "RUN is simply a manifest of data file(s) that are derived from sequencing a library described by the associated EXPERIMENT."
- "All data files listed in a RUN will be merged into a single *.sra* archive file."
SUBMISSION (SRA#; non-public accession)

SRA data formats

References
- Overview: SRA Documentation
- Storage: SRA Archive Documentation, SRA Data Working Group 2021 report, and NLM Support (via email on 2024-08-08)
SRA Normalized Format (*.sra; aka extract-transform-load or ETL format): contains base calls, full base quality scores, and alignments
- Discards original read names [SRA Toolkit GitHub]
- Storage: AWS (hot) via AWS Open Data Program --> free egress worldwide with anonymous identity
SRA Lite (*.sralite; aka ETL-BQS for ETL format without base quality scores): contains base calls, per-read quality flag, and alignments
- Discards original read names [SRA Toolkit GitHub]
- The per-read quality flag (Read_Filter) is either pass or reject. See SRA Documentation for how the SRA determines whether a read passes the read filter.
- Storage
  - NCBI servers: free egress worldwide with anonymous identity
  - Cloud (AWS and GCP): hot; free egress to cloud services in the same geographical region
Originally submitted source files
- Storage: AWS, mostly cold storage

Accessing SRA data

Web interface
- Run Browser: Search by SRA Run Accession (SRR#) to see metadata, taxonomy analysis, read sequences, and data access information about the run, as well as a tool to download FASTA/FASTQ files for runs in the same SRA Experiment.
  - The Data access tab indicates where the data is stored (NCBI, AWS, or GCP servers) and what types of egress is free.
  - The FASTA/FASTQ download web interface only allows a limited download of <5 Gb of sequence over HTTP. [SRA Documentation]
- Run Selector: Search by SRA, BioProject, BioSample, or GEO accessions to see all associated SRA Runs. Offers an interface to download metadata or retrieve the data from cold cloud storage to a cloud bucket.
AWS
- Buckets available through the Registry of Open Data (s3://sra-pub-src-1/, s3://sra-pub-src-2/, and s3://sra-pub-run-odp/; see NIH NCBI Sequence Read Archive (SRA) on AWS) contain "hot" data that is free to download anonymously. Those buckets can be directly browsed anonymously using commands like
```
aws s3 ls s3://sra-pub-src-1/ --no-sign-request
```
  and files can be downloaded directly via HTTP.
  - Example: Consider SRA Run DRR000110. The SRA Run Browser shows that the original FASTQ files are hosted in the S3 bucket sra-pub-src-1 and available for anonymouse, free egress worldwide. One can list the raw files associated with that run via
```
aws s3 ls s3://sra-pub-src-1/DRR000110/ --no-sign-request
```
    and download the files using a command like
```
aws s3 cp s3://sra-pub-src-1/DRR000110 . --no-sign-request --recursive
```
    to copy the raw files 090324_30WB8AAXX_s_3_sequence.txt.tar.gz.1 and 090324_30WB8AAXX_s_4_sequence.txt.tar.gz.1 into the current working directory. Extracting those archives yields two FASTQ files s_3_sequence.txt and s_4_sequence.txt.
    - As shown in the SRA Run Browser, the raw TAR archives can also be directly downloaded from their S3 buckets via HTTP.
    - Instead of using the AWS CLI, the SRA Toolkit also supports downloading the raw data directly via the --type argument. See below.
- All other buckets that are shown in the Run Browser for any SRA Run appear to either be region-specific in their free egress support or host data in cold storage.
Downloading data from cold storage: use the "Create a Data Delivery order" page to retrieve the data into a user's cloud storage bucket. This will incur cloud storage costs for the user. The data can then be retrieved from the user cloud storage bucket, potentially incurring additional costs.
- Follow the instructions on the "Create a Data Delivery order" page to adjust bucket permissions. I successfully retrieved data using the following permissions settings with a new S3 bucket:
  - Do not block public access (uncheck all "block public access" or "block all public access" boxes)
  - Copy the automatically generated bucket policy (JSON text) from the "Create a Data Delivery order" page to the "Bucket policy" section of the Permissions tab of the bucket.
  - Upon retrieval (which may take up to 48 hours), a metadata CSV file is deposited into the target bucket along with a folder (with the SRA Run accession as its name) containing the requested data.
- If logged into your MyNCBI account, the "Create a Data Delivery order" page will show the status of recent data delivery orders from the last 30 days.
Download data from hot storage: download using the SRA Toolkit or using cloud APIs
- Example: SRA Run DRR310659, which is availabe in SRA Normalized Format on GCP at gs://sra-pub-run-110/DRR310659/DRR310659.1 with free egress to gs.us-east1. (It is also available with free egress worldwide from NCBI servers, but for the sake of this example, we restrict ourselves to downloading from GCP.)
  - To download using the SRA Toolkit specifically from the GCP bucket (as opposed to other servers):
```
fasterq-dump --location gs://sra-pub-run-110/DRR310659/DRR310659.1 DRR310659
```
    - The --location argument is explained in the help of fasterq-dump version 3.0.0 but not the latest SRA Toolkit version 3.1.1.
  - To download using the gcloud CLI:
```
gcloud storage --billing-project=<billing-project> cp gs://sra-pub-run-110/DRR310659/DRR310659.1 .
```
    where <billing-project> is a project ID shown under the "ID" column at https://console.cloud.google.com/billing/projects. This downloads a SRA Normalized Format file that can be converted to FASTQ and other formats via the SRA Toolkit programs, such as
```
fasterq-dump ./DRR310659
```
  - Costs: Presumably if these commands are run within a Google Cloud virtual machine (Cloud Shell or Compute Engine instance) located in a us-east1 region, then the download is free. However, if downloading to local premises, then an egress cost may be incurred.
- If originally submitted source files are available in hot storage, they can be downloaded directly using the SRA Toolkit by using the --type argument.
  - Example: The Run Browser for SRA Run DRR000110 lists the original format files as type fastq. We already previously explored how to download the raw files via HTTP or the AWS CLI. To download using the SRA Toolkit, run
```
prefetch --type fastq DRR000110
```
    which will create a folder DRR000110 with the 2 raw data files inside: 090324_30WB8AAXX_s_3_sequence.txt.tar.gz and 090324_30WB8AAXX_s_4_sequence.txt.tar.gz.

SRA Toolkit

Documentation: https://github.com/ncbi/sra-tools/wiki

Note (as of 2024-08-07): Because there are a lot of Wiki pages, some of them are initially hidden. Click on "Show 11 more pages..." to see all of them.

Building from source: see ncbi/sra-tools#937 (comment)

Using CMake:

# set where to install the sratoolkit
DIR_INSTALL="$HOME/local/sratoolkit"
# set where to download source code and create build directory
DIR_TMP="$HOME/tmp/scratch/sratoolkit_build"

cd "$DIR_TMP"
git clone https://github.com/ncbi/ncbi-vdb.git
git clone https://github.com/ncbi/sra-tools.git
mkdir build
cd build
cmake -S "$(cd ../ncbi-vdb; pwd)" -B ncbi-vdb
cmake --build ncbi-vdb
cmake -D VDB_LIBDIR="${PWD}/ncbi-vdb/lib" -D CMAKE_INSTALL_PREFIX="$DIR_INSTALL" -S "$(cd ../sra-tools; pwd)" -B sra-tools 
cmake --build sra-tools --target install

# binaries are installed to "$DIR_INSTALL/bin/

Using Autoconf

DIR_WD="$(pwd -P)"
mkdir sra_install
mkdir sra_build
mkdir sra_src
cd sra_src
git clone https://github.com/ncbi/ncbi-vdb.git
git clone https://github.com/ncbi/sra-tools.git
cd ncbi-vdb
./configure --build-prefix="$DIR_WD/sra_build" --prefix="$DIR_WD/sra_install"
make
make install
cd ../sra-tools
./configure --build-prefix="$DIR_WD/sra_build" --prefix="$DIR_WD/sra_install"
make
make install

# binaries are now available at "$DIR_WD/bin"

There are 2 basic ways to download data from the SRA with the SRA Toolkit:

Prefetch and then extract to desired data type
- Tutorial: https://github.com/ncbi/sra-tools/wiki/08.-prefetch-and-fasterq-dump
- For in-depth documentation for the fasterq-dump tool, see the page HowTo: fasterq dump.
On demand
- Tutorial: https://github.com/ncbi/sra-tools/wiki/Download-On-Demand

While prefetch and fasterq-dump are the main programs, the SRA Toolkit comes with all of the following tools in the bin directory where it is installed. Some useful commands are indicated below.

abi-dump
align-info
cache-mgr
check-corrupt
fasterq-dump
fastq-dump
illumina-dump
kcbmeta
ngs-pileup
prefetch
- Use the --max-size argument to download more than 20 GB of data.
rcexplain
ref-variation
sam-dump
sff-dump
sra-info
srapath
sra-pileup
sra-search
sra-stat
sratools
test-sra
var-expand
vdb-config
- The full configuration of the toolkit can be viewed by running vdb-config. It appears that the interactive configuration settings from running vdb-config -i are saved to ~/.ncbi/user-settings.mkfg.
- The interactive form of vdb-config does not expose all settings, some of which can only be set via the command line. See https://github.com/ncbi/sra-tools/wiki/06.-Connection-Timeouts.
vdb-decrypt
vdb-dump
- vdb-dump --info <accession>: show the size (in bytes) of the accession, among other information
vdb-encrypt
vdb-validate

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How-Tos

Convert between chromosome names

Which genome assembly to use for alignment

Choosing blacklist regions

How the Blacklist program works

Genome annotations

Repeat Masking

UCSC

NCI Genomic Data Commons

Searching and filtering data

TCGA Barcode

NCBI Entrez and E-Utilities

NCBI Genomics Databases

Relationship between GEO, SRA, BioSample, and BioProject accessions

Example

Gene Expression Omnibus (GEO)

NCBI Sequencing Read Archive (SRA)

SRA Toolkit

FilesExpand file tree

data.md

Latest commit

History

data.md

File metadata and controls

How-Tos

Convert between chromosome names

Which genome assembly to use for alignment

Choosing blacklist regions

How the Blacklist program works

Genome annotations

Repeat Masking

UCSC

NCI Genomic Data Commons

Searching and filtering data

TCGA Barcode

NCBI Entrez and E-Utilities

NCBI Genomics Databases

Relationship between GEO, SRA, BioSample, and BioProject accessions

Example

Gene Expression Omnibus (GEO)

NCBI Sequencing Read Archive (SRA)

SRA Toolkit