seqspec is a machine-readable YAML file format for genomic library sequence and structure. It was inspired by and builds off of the Teichmann Lab Single Cell Genomics Library Structure by Xi Chen.
A list of seqspec examples for multiple assays can be found in the assays/ folder. Each spec.yaml describes the 5'->3' "Final library structure" for the assay. Sequence specification files can be formatted with the seqspec command line tool.
# development
pip install git+https://github.com/IGVF/seqspec.git
# released
pip install seqspec
seqspec format --helpEach assay is described by two objects: the Assay object and the Region object. A library is described by one Assay object and multiple (possibly nested) Region objects. The Region objects are grouped with a join operation and an order on the subRegions specified. A simple (but not fully specified example) looks like the following:
modalities:
- Modality1
- Modality2
assay_spec:
- region_id: Modality1
regions:
- region_id: Region1
...
- region_id: Region2
...
- region_id: Modality2
...
In order to catalogue relevant information for each library structure, multiple properties are specified for each Assay and each Region. A description of the Assay and Region schema can be found in seqspec/schema/seqspec.schema.json.
Below is an example of an Assay.
!Assay
name: SPLiT-seq
doi: https://doi.org/10.1126/science.aam8999
publication_date: 15 March 2018
description: split-pool ligation-based transcriptome sequencing
modalities:
- RNA
lib_struct: https://teichlab.github.io/scg_lib_structs/methods_html/SPLiT-seq.html
assay_spec:nameis a free-form string that labels the assaydoiis the doi link to the paper/protocol that describes the assay (if it exists)publication_dateis the date the assay was published (linked to by thedoi). Must be in DD Month Year format.descriptionis a free-form string that describes the assaymodalitiesis a list ofregion_typesthat are contained within the library. Each string must be present in exactly oneRegionin the first "level" of theassay_spec.lib_structis a link to the manually annotated library structure developed by Xi Chen in Sarah Teichmann's lab.assay_specis a list ofRegions.
Below is an example of a Region.
!Region
region_id: barcode-1
region_type: barcode
name: barcode-1
sequence_type: onlist
sequence: NNNNNNNN
min_len: 8
max_len: 8
onlist: !Onlist
filename: barcode-1_onlist.txt
md5: null
regions: nullregion_idis a free-form string and must be unique across all regions in theseqspecfile.- if the assay contains multiple regions of the same
region_typeit may be useful to append an integer to the end of theregion_idto differentiate those regions. For example, if the assay had fourbarcodesthen each of the individualbarcoderegions could have theregion_idsbarcode-1,barcode-2,barcode-3,barcode-4.
- if the assay contains multiple regions of the same
region_typecan be one of the following:- RNA
- ATAC
- CRISPR
- Protein
- illumina_p5
- illumina_p7
- nextera_read1
- nextera_read2
- s5
- s7
- ME1
- ME2
- truseq_read1
- truseq_read2
- index5
- index7
- fastq
- barcode
- umi
- cDNA
- gDNA
nameis a free-form string for describing the regionsequence_typecan be one of the following:fixedindicates that sequence string is knownjoinedindicates that the sequence is created (joined) from nested regionsonlistindicates that the sequence is derived from an onlist (if specified, thenonlistmust be non-nullrandomindicates that the sequence is not known a-priori
sequenceis a representation of the sequence- if the
sequence_typeisfixedthen the actual sequence string is provided - if the
sequence_typeisjoinedthen field must be the concatenation of the nested regions - if the
sequence_typeisonlistthen field must anNstring of length of the shortest sequence on the onlist - if the
sequence_typeisrandomthen the field must be anX
- if the
min_lenis an integer greater than or equal to zero. It represents the minimum possible length of thesequencemax_lenis an integer greater than or equal to themin_len. It represents the maximum length of thesequenceonlistcan benullor containfilenamewhich is a path (relative to theseqspecfile containing a list of sequencesmd5is the md5sum of the uncompressed file infilename
regionscan either benullor contain a list ofregionsas specified above.
For more information about the specification of the various fields, please see seqspec.schema.json which is the JSON schema representation of the various fields described above.
The YAML file contains tags (strings prepended with an exclamation point !) to describe the various objects (Assay, Region, Onlist). The purpose of these tags is to make it easy to load the seqspec into python as a python object. This makes it possibe to access the various attrbiutes of the seqspec file with "dot notation" as follows:
from seqspec.utils import load_spec
spec = load_spec("seqspec/assays/10x-RNA-v3/spec.yaml")
print(specA.get_modality("RNA").sequence)
# AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCTNNNNNNNNNNNNNNNNNNNNNNNNNNNNXAGATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNNNATCTCGTATGCCGTCTTCTGCTTGFor consistency across assays I suggest the following naming conventions for standard regions. Note that the region_id for all atomic regions should be unique.
# Assay region
!Assay
name: My-RNA-Assay
doi: mydoi.org
publication_date: 01 January 2001
description: My custom assay
modalities:
- RNA
lib_struct: www.link-to-libstructs.com
assay_spec:
- !Region
region_id: RNA
region_type: RNA
name: My RNA
sequence_type: joined
sequence:
min_len: 0
max_len: 0
onlist:
regions:
# illumina_p5
- !Region
region_id: illumina_p5
region_type: illumina_p5
name: illumina_p5
sequence_type: fixed
sequence: AATGATACGGCGACCACCGAGATCTACAC
min_len: 29
max_len: 29
onlist:
regions:
# illumina_p7
- !Region
region_id: illumina_p7
region_type: illumina_p7
name: illumina_p7
sequence_type: fixed
sequence: ATCTCGTATGCCGTCTTCTGCTTG
min_len: 24
max_len: 24
onlist:
regions:
# nextera_read1
- !Region
region_id: nextera_read1
region_type: nextera_read1
name: nextera_read1
sequence_type: fixed
sequence: fixed
min_len: 33
max_len: 33
onlist:
regions:
- !Region
region_id: s5
region_type: s5
name: s5
sequence_type: TCGTCGGCAGCGTC
sequence: fixed
min_len: 14
max_len: 14
onlist:
regions:
- !Region
region_id: ME1
region_type: ME1
name: ME1
sequence_type: AGATGTGTATAAGAGACAG
sequence: fixed
min_len: 19
max_len: 19
onlist:
regions:
# nextera_read2
- !Region
region_id: nextera_read2
region_type: nextera_read2
name: nextera_read2
sequence_type: joined
sequence: CTGTCTCTTATACACATCTCCGAGCCCACGAGAC
min_len: 34
max_len: 34
onlist:
regions:
- !Region
region_id: ME2
region_type: ME2
name: ME2
sequence_type: fixed
sequence: CTGTCTCTTATACACATCT
min_len: 19
max_len: 19
onlist:
regions:
- !Region
region_id: s7
region_type: s7
name: s7
sequence_type: fixed
sequence: CCGAGCCCACGAGAC
min_len: 15
max_len: 15
onlist:
regions:
# truseq_read1
- !Region
region_id: truseq_read1
region_type: truseq_read1
name: truseq_read1
sequence_type: fixed
sequence: ACACTCTTTCCCTACACGACGCTCTTCCGATCT
min_len: 33
max_len: 33
onlist:
regions:
# truseq_read2
- !Region
region_id: truseq_read2
region_type: truseq_read2
name: truseq_read2
sequence_type: fixed
sequence: AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC
min_len: 34
max_len: 34
onlist:
regions:
# index5
- !Region
region_id: I2.fastq.gz
region_type: I2.fastq.gz
name: Index 2 FASTQ
sequence_type: joined
sequence: NNNNNNNN
min_len: 8
max_len: 8
onlist:
regions:
- !Region
region_id: index5
region_type: index5
name: index5
sequence_type: onlist
sequence: NNNNNNNN
min_len: 8
max_len: 8
onlist: !Onlist
filename: index5_onlist.txt
md5: null
regions:
# index7
- !Region
region_id: I1.fastq.gz
region_type: I1.fastq.gz
name: Index 1 FASTQ
sequence_type: joined
sequence: NNNNNNNN
min_len: 8
max_len: 8
onlist:
regions:
- !Region
region_id: index7
region_type: index7
name: index7
sequence_type: onlist
sequence: NNNNNNNN
min_len: 8
max_len: 8
onlist: !Onlist
filename: index7_onlist.txt
md5: null
regions:
# Read 1 Fastq
- !Region
region_id: R1.fastq.gz
region_type: R1.fastq.gz
name: Read 1 FASTQ
sequence_type: joined
sequence:
min_len: 0
max_len: 0
onlist:
regions:
# Read 2 Fastq
- !Region
region_id: R2.fastq.gz
region_type: R2.fastq.gz
name: Read 2 FASTQ
sequence_type: joined
sequence:
min_len: 0
max_len: 0
onlist:
regions:
# barcode
# note for multiple of the same region
# the region id gets a number, i.e. barcode-1 barcode-2
- !Region
region_id: barcode
region_type: barcode
name: Barcode
sequence_type: onlist
sequence: NNNNNNNNNNNNNNNN
min_len: 16
max_len: 16
onlist: !Onlist
filename: barcode_onlist.txt
md5: null
regions:
# umi "Unique Molecular Identifier"
- !Region
region_id: umi
region_type: umi
name: Unique Molecular Identifier
sequence_type: random
sequence: NNNNNNNNNN
min_len: 10
max_len: 10
onlist:
regions:
# cDNA "complementary DNA"
- !Region
region_id: cDNA
region_type: cDNA
name: Complementary DNA
sequence_type: random
sequence: X
min_len: 1
max_len: 98
onlist:
regions:
# gDNA "genomic DNA"
- !Region
region_id: gDNA
region_type: gDNA
name: Genomic DNA
sequence_type: random
sequence: X
min_len: 1
max_len: 98
onlist:
regions:
# Regions corresponding to FASTQ files are annotated a standard naming convention
# R1.fastq.gz "Read 1"
# R2.fastq.gz "Read 2"
# I1.fastq.gz "Index 1, i7 index"
# I2.fastq.gz "Index 2, i5 index"Thank you for wanting to improve seqspec. If you have a bug that is related to seqspec please create an issue. The issue should contain
- the
seqspeccommand ran, - the error message, and
- the
seqspecand python version.
If you'd like to add assays sequence specifications or make modifications to the seqspec tool please do the following:
- Fork the project.
# Press "Fork" at the top right of the GitHub page
- Clone the fork and create a branch for your feature
git clone https://github.com/<USERNAME>/seqspec.git
cd seqspec
git checkout -b cool-new-feature- Make changes, add files, and commit
# make changes, add files, and commit them
git add path/to/file1.yaml path/to/file2.yaml
git commit -m "I made these changes"- Push changes to GitHub
git push origin cool-new-feature- Submit a pull request
If you are unfamilar with pull requests, you can find more information on the GitHub help page.