Welcome to the Chromium documentation
Chromium is a Snakemake workflow to process single cell gene expression data from the 10x Genomics platfom.
The workflow can be executed using the following command:
$ snakemake --cores all --use-condaThis will use all available cores and deploy software dependencies via the conda package manager. For further information, please refer to the official Snakemake documentation.
Use git to pull the latest release:
$ git pullChromium is configured by editing the files in the config directory:
- config.yaml
- samples.csv
- units.csv
An error will be thrown if these files are missing or do not contain the required information.
The workflow config is a YAML file containing information about the workflow parameters:
- each line contains a name:value pair
- each name:value pair corresponds to a workflow parameter
The workflow config must contain the following:
| Name | Description | Example |
|---|---|---|
| samples | Path to samples.csv | config/samples.csv |
| units | Path to units.csv | config/units.csv |
| source | Transcriptome source | Ensembl |
| organism | Species name | Homo sapiens |
| release | Release number | 101 |
| genome | Genome assembly | GRCh38.p13 |
| chemistry | Chemistry version | 10xv3 |
Example of a valid workflow config:
samples: "config/samples.csv"
units: "config/units.csv"
source: "Ensembl"
organism: "Homo sapiens"
release: "101"
genome: "GRCh38.p13"
chemistry: "10xv3"The samples table is a CSV file containing information about the biological samples in your experiment:
- each row corresponds to one sample
- each column corresponds to one attribute
For each sample, you must provide the following:
| Column | Description | Example |
|---|---|---|
| sample | Sample name | S1 |
Example of a valid samples table:
sample,condition
S1,control
S2,treatment
The units table is a CSV file containing information about the sequencing units in your experiment:
- each row corresponds to one unit
- each column corresponds to one attribute
For each unit, you must provide the following:
| Column | Description | Example |
|---|---|---|
| sample | Sample name | S1 |
| unit | Unit name | L001 |
| read1 | Read 1 file | S1_L001_1.fastq.gz |
| read2 | Read 2 file | S1_L001_2.fastq.gz |
Example of valid units table:
sample,unit,read1,read2
S1,L001,S1_L001_1.fastq.gz,S1_L001_2.fastq.gz
S1,L002,S1_L002_1.fastq.gz,S1_L002_2.fastq.gz
S2,L001,S2_L001_1.fastq.gz,S2_L001_2.fastq.gz
S2,L002,S2_L002_1.fastq.gz,S2_L002_2.fastq.gz
Chromium processes the data using three different quantification methods:
- kallisto|bustools
- Alevin
- STARsolo
For each method, all output files are saved to disk and a SingleCellExperiment
object containing spliced and unspliced count matrices is generated.
Chromium saves all output files in the output directory. The files are organised
by software and labelled by sample name and data type:
output
│
├── busparse
│ └── {genome}.cDNA_introns.fa
│ ├── {genome}.cDNA_tx_to_capture.txt
│ ├── {genome}.introns_tx_to_capture.txt
│ └── {genome}.tr2g.tsv
│
├── bustools
│ ├── {sample}.correct.bus
│ ├── {sample}.sort.bus
│ ├── {sample}.spliced.bus
│ ├── {sample}.unspliced.bus
│ ├── {sample}.spliced.mtx
│ ├── {sample}.spliced.barcodes.txt
│ ├── {sample}.spliced.genes.txt
│ ├── {sample}.unspliced.mtx
│ ├── {sample}.unspliced.barcodes.txt
│ └── {sample}.unspliced.genes.txt
│
├── eisar
│ ├── {genome}.annotation.gtf
│ ├── {genome}.fa
│ ├── {genome}.features.tsv
│ ├── {genome}.ranges.rds
│ └── {genome}.tx2gene.tsv
│
├── genomepy
│ └── index
│ └── {genome}.idx
│ ├── {genome}.annotation.bed.gz
│ ├── {genome}.annotation.gtf.gz
│ ├── {genome}.fa
│ ├── {genome}.fa.fai
│ ├── {genome}.fa.sizes
│ └── {genome}.gaps.bed
│
├── gffread
│ ├── {genome}.id2name.tsv
│ ├── {genome}.mrna.txt
│ ├── {genome}.rrna.txt
│ └── {genome}.tx2gene.tsv
│
├── kallisto
│ ├── bus
│ │ └── {sample}
│ └── index
│ └── {genome}.idx
│
├── salmon
│ ├── alevin
│ │ └── {sample}
│ ├── genome
│ │ └── {genome}
│ └── index
│ └── {genome}
│
├── singlecellexperiment
│ ├── {sample}.kallisto.rds
│ ├── {sample}.salmon.rds
│ └── {sample}.star.rds
│
└── star
├── align
│ └── {sample}
└── index
└── {genome}For each rule
The busparse directory contains output files generated by the
get_velocity_files and read_velocity_files functions from the
BUSpaRse software:
| File | Format | Description |
|---|---|---|
{genome}.cDNA_introns.fa |
FASTA | Spliced transcript and intron sequences |
{genome}.cDNA_tx_to_capture.txt |
TXT | Transcript IDs of spliced transcripts |
{genome}.introns_tx_to_capture.txt |
TXT | Transcript IDs of introns |
{genome}.tr2g.tsv |
TSV | Transcript to gene ID table |
The bustools directory contains output files generated by the correct, sort, capture, and count commands of the BUStools software:
| File | Format | Description |
|---|---|---|
{sample}.correct.bus |
BUS | Corrected BUS file |
{sample}.sort.bus |
BUS | Sorted BUS file |
{sample}.spliced.bus |
BUS | Spliced BUS file |
{sample}.unspliced.bus |
BUS | Unspliced BUS file |
{sample}.spliced.barcodes.txt |
TXT | Spliced barcodes |
{sample}.spliced.genes.txt |
TXT | Spliced genes |
{sample}.spliced.mtx |
MTX | Spliced counts |
{sample}.unspliced.barcodes.txt |
TXT | Unspliced barcodes |
{sample}.unspliced.genes.txt |
TXT | Unspliced genes |
{sample}.unspliced.mtx |
MTX | Unspliced counts |
The eisar directory contains output files generated by the exportToGtf,
getFeatureRanges, and getTx2Gene functions from the eisaR software:
| File | Format | Description |
|---|---|---|
{genome}.annotation.gtf |
GTF | Expanded gene annotation |
{genome}.fa |
FASTA | Spliced transcript and intron sequences |
{genome}.features.tsv |
TSV | Spliced transcript and intron names |
{genome}.ranges.rds |
RDS | Spliced transcript and intron ranges |
{genome}.tx2gene.tsv |
TSV | Transcript to gene table |
The genomepy directory contains output files generated by the install
command from the genomepy software.
| File | Format | Description |
|---|---|---|
{genome}.annotation.bed.gz |
BED | Gene annotation |
{genome}.annotation.gtf.gz |
GTF | Gene annotation |
{genome}.fa |
FASTA | Genome sequence |
{genome}.fa.sizes |
TSV | Chromosome size |
{genome}.gaps.bed |
BED | Gap location |
README.txt |
TXT | README |
The gffread directory contains output files generated by the GffRead software:
| File | Format | Description |
|---|---|---|
{genome}.id2name.tsv |
TSV | The gene_id to gene_name annotation table |
{genome}.mrna.txt |
TXT | The mRNA gene_id annotation table |
{genome}.rrna.txt |
TXT | The rRNA gene_id annotation table |
{genome}.tx2gene.tsv |
TSV | The transcript_id to gene_id annotation table |
The kallisto directory contains output files generated by the index and bus commands of the Kallisto software:
| File | Format | Description |
|---|---|---|
bus/{sample}/matrix.ec |
MTX | Equivalence class |
bus/{sample}/output.bus |
BUS | Output |
bus/{sample}/run_info.json |
JSON | Run information |
bus/{sample}/transcripts.txt |
TXT | Transcript names |
index/{genome}.idx |
IDX | Kallisto index |
The salmon directory contains output files generated by the index and alevin commands from the Salmon software:
| File | Format | Description |
|---|---|---|
alevin/{sample}/quants_mat.gz |
TSV | Compressed count matrix |
alevin/{sample}/quants_mat_cols.txt |
TXT | Column header (gene_id) of the matrix |
alevin/{sample}/quants_mat_rows.txt |
TXT | Row index (CB-ids) of the matrix |
alevin/{sample}/quants_tier_mat.gz |
TSV | Tier categorization of the matrix |
index/{genome} |
DIR | Salmon index |
The singlecellexperiment directory contains SingleCellExperiment object files containing spliced and unspliced count matrices:
| File | Format | Description |
|---|---|---|
kallisto.rds |
RDS | SingleCellExperiment object from kallisto|bustools workflow |
salmon.rds |
RDS | SingleCellExperiment object from Alevin workflow |
star.rds |
RDS | SingleCellExperiment object from STARsolo workflow |
The star directory contains output files generated by the genomeGenerate and alignReads commands from the STAR software:
| File | Format | Description |
|---|---|---|
align/{sample}/Aligned.sortedByCoord.out.bam |
BAM | Spliced alignments |
align/{sample}/Solo.out/Gene |
DIR | Gene directory |
align/{sample}/Solo.out/Velocyto |
DIR | Velocyto directory |
index/{genome} |
DIR | STAR index |
Test cases are in the .test directory. They are automatically executed via continuous integration with GitHub Actions.
I would suggest there is no best workflow, each one captures a unique aspect of the data by the counting strategies they have implemented. For a more in-depth discussion, please refer to this research article by Soneson and colleagues: https://doi.org/10.1371/journal.pcbi.1008585
The workflow uses genomepy and gffread to download and parse the user-specified reference genome and annotation. Therefore, any genome release compatible with these software should be supported.
If the sequencing runs were performed across multiple lanes on the same date, it is unlikely that a batch effect is present and I would recommend quantifying the files all together. Below is an example units table showing how to specify multiple sequencing runs jointly for a given sample:
sample,unit,read1,read2
S1,L001,S1_L001.fastq.gz,S1_L001.fastq.gz
S1,L002,S1_L002.fastq.gz,S1_L002.fastq.gz
Alternatively, if the sequencing runs were performed on different machines and different dates, there is potential for a batch effect and I would recommend quantifying the files separately until this can be investigated. Below is an example units table showing how to specify multiple sequencing runs independently for a given sample:
sample,unit,read1,read2
S1_L001,L001,S1_L001.fastq.gz,S1_L001.fastq.gz
S1_L002,L002,S1_L002.fastq.gz,S1_L002.fastq.gz
Chromium relies on multiple open-source software. Please give appropriate credit by citing them in your publication:
Alevin
Srivastava, A., Malik, L., Smith, T. et al. Alevin efficiently estimates accurate gene abundances from dscRNA-seq data. Genome Biol 20, 65 (2019).
Anaconda
Anaconda Software Distribution. Computer software. Vers. 2-2.4.0. Anaconda, Nov. 2016. Web.
BUSpaRse
Moses L, Pachter L (2021). BUSpaRse: kallisto | bustools R utilities. R package version 1.4.2.
BUStools
Melsted, P., Booeshaghi, A.S., Liu, L. et al. Modular, efficient and constant-memory single-cell RNA-seq preprocessing. Nat Biotechnol (2021).
Bioconda
Grüning, B., Dale, R., Sjödin, A. et al. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods 15, 475–476 (2018).
eisaR
Stadler MB, Gaidatzis D, Burger L, Soneson C (2020). eisaR: Exon-Intron Split Anaalysis (EISA) in R. R package version 1.0.
genomepy
Heeringen, (2017), genomepy: download genomes the easy way, Journal of Open Source Software, 2(16), 320.
GffRead
Pertea G and Pertea M. GFF Utilities: GffRead and GffCompare [version 2; peer review: 3 approved]. F1000Research 2020, 9:304.
Kallisto
Bray, N., Pimentel, H., Melsted, P. et al. Near-optimal probabilistic RNA-seq quantification. Nat Biotechnol 34, 525–527 (2016).
Python
Python Core Team (2015). Python: A dynamic, open source programming language.
Python Software Foundation.
R
R Core Team (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.
STAR
Alexander Dobin, Carrie A. Davis, Felix Schlesinger, Jorg Drenkow, Chris Zaleski, Sonali Jha, Philippe Batut, Mark Chaisson, Thomas R. Gingeras, STAR: ultrafast universal RNA-seq aligner, Bioinformatics, Volume 29, Issue 1, January 2013, Pages 15–21.
SingleCellExperiment
Amezquita, R.A., Lun, A.T.L., Becht, E. et al. Orchestrating single-cell analysis with Bioconductor. Nat Methods 17, 137–145 (2020).
Snakemake
Mölder F, Jablonski KP, Letcher B et al. Sustainable data analysis with Snakemake [version 2; peer review: 2 approved]. F1000Research 2021, 10:33.
