Skip to content

Latest commit

 

History

History
420 lines (313 loc) · 14.7 KB

File metadata and controls

420 lines (313 loc) · 14.7 KB

Documentation

Welcome to the Chromium documentation

Contents

Introduction

Chromium is a Snakemake workflow to process single cell gene expression data from the 10x Genomics platfom.

Usage

The workflow can be executed using the following command:

$ snakemake --cores all --use-conda

This will use all available cores and deploy software dependencies via the conda package manager. For further information, please refer to the official Snakemake documentation.

Update the workflow

Use git to pull the latest release:

$ git pull

Configuration

Chromium is configured by editing the files in the config directory:

  • config.yaml
  • samples.csv
  • units.csv

An error will be thrown if these files are missing or do not contain the required information.

Workflow config

The workflow config is a YAML file containing information about the workflow parameters:

  • each line contains a name:value pair
  • each name:value pair corresponds to a workflow parameter

The workflow config must contain the following:

Name Description Example
samples Path to samples.csv config/samples.csv
units Path to units.csv config/units.csv
source Transcriptome source Ensembl
organism Species name Homo sapiens
release Release number 101
genome Genome assembly GRCh38.p13
chemistry Chemistry version 10xv3

Example of a valid workflow config:

samples: "config/samples.csv"
units: "config/units.csv"
source: "Ensembl"
organism: "Homo sapiens"
release: "101"
genome: "GRCh38.p13"
chemistry: "10xv3"

Samples table

The samples table is a CSV file containing information about the biological samples in your experiment:

  • each row corresponds to one sample
  • each column corresponds to one attribute

For each sample, you must provide the following:

Column Description Example
sample Sample name S1

Example of a valid samples table:

sample,condition
S1,control
S2,treatment

Units table

The units table is a CSV file containing information about the sequencing units in your experiment:

  • each row corresponds to one unit
  • each column corresponds to one attribute

For each unit, you must provide the following:

Column Description Example
sample Sample name S1
unit Unit name L001
read1 Read 1 file S1_L001_1.fastq.gz
read2 Read 2 file S1_L001_2.fastq.gz

Example of valid units table:

sample,unit,read1,read2
S1,L001,S1_L001_1.fastq.gz,S1_L001_2.fastq.gz
S1,L002,S1_L002_1.fastq.gz,S1_L002_2.fastq.gz
S2,L001,S2_L001_1.fastq.gz,S2_L001_2.fastq.gz
S2,L002,S2_L002_1.fastq.gz,S2_L002_2.fastq.gz

Output

Chromium processes the data using three different quantification methods:

  • kallisto|bustools
  • Alevin
  • STARsolo

For each method, all output files are saved to disk and a SingleCellExperiment
object containing spliced and unspliced count matrices is generated.

Output directory

Chromium saves all output files in the output directory. The files are organised
by software and labelled by sample name and data type:

output

├── busparse
│   └── {genome}.cDNA_introns.fa
│   ├── {genome}.cDNA_tx_to_capture.txt
│   ├── {genome}.introns_tx_to_capture.txt
│   └── {genome}.tr2g.tsv

├── bustools
│   ├── {sample}.correct.bus
│   ├── {sample}.sort.bus
│   ├── {sample}.spliced.bus
│   ├── {sample}.unspliced.bus
│   ├── {sample}.spliced.mtx
│   ├── {sample}.spliced.barcodes.txt
│   ├── {sample}.spliced.genes.txt
│   ├── {sample}.unspliced.mtx
│   ├── {sample}.unspliced.barcodes.txt
│   └── {sample}.unspliced.genes.txt

├── eisar
│   ├── {genome}.annotation.gtf
│   ├── {genome}.fa
│   ├── {genome}.features.tsv
│   ├── {genome}.ranges.rds
│   └── {genome}.tx2gene.tsv

├── genomepy
│   └── index
│       └── {genome}.idx

│   ├── {genome}.annotation.bed.gz
│   ├── {genome}.annotation.gtf.gz
│   ├── {genome}.fa
│   ├── {genome}.fa.fai
│   ├── {genome}.fa.sizes
│   └── {genome}.gaps.bed

├── gffread
│   ├── {genome}.id2name.tsv
│   ├── {genome}.mrna.txt
│   ├── {genome}.rrna.txt
│   └── {genome}.tx2gene.tsv

├── kallisto
│   ├── bus
│   │   └── {sample}
│   └── index
│       └── {genome}.idx

├── salmon
│   ├── alevin
│   │   └── {sample}
│   ├── genome
│   │   └── {genome}
│   └── index
│       └── {genome}

├── singlecellexperiment
│   ├── {sample}.kallisto.rds
│   ├── {sample}.salmon.rds
│   └── {sample}.star.rds

└── star
    ├── align
    │   └── {sample}
    └── index
        └── {genome}

Output files

For each rule

BUSpaRse

The busparse directory contains output files generated by the get_velocity_files and read_velocity_files functions from the BUSpaRse software:

File Format Description
{genome}.cDNA_introns.fa FASTA Spliced transcript and intron sequences
{genome}.cDNA_tx_to_capture.txt TXT Transcript IDs of spliced transcripts
{genome}.introns_tx_to_capture.txt TXT Transcript IDs of introns
{genome}.tr2g.tsv TSV Transcript to gene ID table

BUStools

The bustools directory contains output files generated by the correct, sort, capture, and count commands of the BUStools software:

File Format Description
{sample}.correct.bus BUS Corrected BUS file
{sample}.sort.bus BUS Sorted BUS file
{sample}.spliced.bus BUS Spliced BUS file
{sample}.unspliced.bus BUS Unspliced BUS file
{sample}.spliced.barcodes.txt TXT Spliced barcodes
{sample}.spliced.genes.txt TXT Spliced genes
{sample}.spliced.mtx MTX Spliced counts
{sample}.unspliced.barcodes.txt TXT Unspliced barcodes
{sample}.unspliced.genes.txt TXT Unspliced genes
{sample}.unspliced.mtx MTX Unspliced counts

eisaR

The eisar directory contains output files generated by the exportToGtf,
getFeatureRanges, and getTx2Gene functions from the eisaR software:

File Format Description
{genome}.annotation.gtf GTF Expanded gene annotation
{genome}.fa FASTA Spliced transcript and intron sequences
{genome}.features.tsv TSV Spliced transcript and intron names
{genome}.ranges.rds RDS Spliced transcript and intron ranges
{genome}.tx2gene.tsv TSV Transcript to gene table

genomepy

The genomepy directory contains output files generated by the install
command from the genomepy software.

File Format Description
{genome}.annotation.bed.gz BED Gene annotation
{genome}.annotation.gtf.gz GTF Gene annotation
{genome}.fa FASTA Genome sequence
{genome}.fa.sizes TSV Chromosome size
{genome}.gaps.bed BED Gap location
README.txt TXT README

GffRead

The gffread directory contains output files generated by the GffRead software:

File Format Description
{genome}.id2name.tsv TSV The gene_id to gene_name annotation table
{genome}.mrna.txt TXT The mRNA gene_id annotation table
{genome}.rrna.txt TXT The rRNA gene_id annotation table
{genome}.tx2gene.tsv TSV The transcript_id to gene_id annotation table

Kallisto

The kallisto directory contains output files generated by the index and bus commands of the Kallisto software:

File Format Description
bus/{sample}/matrix.ec MTX Equivalence class
bus/{sample}/output.bus BUS Output
bus/{sample}/run_info.json JSON Run information
bus/{sample}/transcripts.txt TXT Transcript names
index/{genome}.idx IDX Kallisto index

Salmon

The salmon directory contains output files generated by the index and alevin commands from the Salmon software:

File Format Description
alevin/{sample}/quants_mat.gz TSV Compressed count matrix
alevin/{sample}/quants_mat_cols.txt TXT Column header (gene_id) of the matrix
alevin/{sample}/quants_mat_rows.txt TXT Row index (CB-ids) of the matrix
alevin/{sample}/quants_tier_mat.gz TSV Tier categorization of the matrix
index/{genome} DIR Salmon index

SingleCellExperiment

The singlecellexperiment directory contains SingleCellExperiment object files containing spliced and unspliced count matrices:

File Format Description
kallisto.rds RDS SingleCellExperiment object from kallisto|bustools workflow
salmon.rds RDS SingleCellExperiment object from Alevin workflow
star.rds RDS SingleCellExperiment object from STARsolo workflow

STAR

The star directory contains output files generated by the genomeGenerate and alignReads commands from the STAR software:

File Format Description
align/{sample}/Aligned.sortedByCoord.out.bam BAM Spliced alignments
align/{sample}/Solo.out/Gene DIR Gene directory
align/{sample}/Solo.out/Velocyto DIR Velocyto directory
index/{genome} DIR STAR index

Tests

Test cases are in the .test directory. They are automatically executed via continuous integration with GitHub Actions.

FAQ

Which quantification workflow should I use?

I would suggest there is no best workflow, each one captures a unique aspect of the data by the counting strategies they have implemented. For a more in-depth discussion, please refer to this research article by Soneson and colleagues: https://doi.org/10.1371/journal.pcbi.1008585

What reference genomes are supported?

The workflow uses genomepy and gffread to download and parse the user-specified reference genome and annotation. Therefore, any genome release compatible with these software should be supported.

How do I combine multiple sequencing runs?

If the sequencing runs were performed across multiple lanes on the same date, it is unlikely that a batch effect is present and I would recommend quantifying the files all together. Below is an example units table showing how to specify multiple sequencing runs jointly for a given sample:

sample,unit,read1,read2
S1,L001,S1_L001.fastq.gz,S1_L001.fastq.gz
S1,L002,S1_L002.fastq.gz,S1_L002.fastq.gz

Alternatively, if the sequencing runs were performed on different machines and different dates, there is potential for a batch effect and I would recommend quantifying the files separately until this can be investigated. Below is an example units table showing how to specify multiple sequencing runs independently for a given sample:

sample,unit,read1,read2
S1_L001,L001,S1_L001.fastq.gz,S1_L001.fastq.gz
S1_L002,L002,S1_L002.fastq.gz,S1_L002.fastq.gz

References

Chromium relies on multiple open-source software. Please give appropriate credit by citing them in your publication:

Alevin
Srivastava, A., Malik, L., Smith, T. et al. Alevin efficiently estimates accurate gene abundances from dscRNA-seq data. Genome Biol 20, 65 (2019).

Anaconda
Anaconda Software Distribution. Computer software. Vers. 2-2.4.0. Anaconda, Nov. 2016. Web.

BUSpaRse
Moses L, Pachter L (2021). BUSpaRse: kallisto | bustools R utilities. R package version 1.4.2.

BUStools
Melsted, P., Booeshaghi, A.S., Liu, L. et al. Modular, efficient and constant-memory single-cell RNA-seq preprocessing. Nat Biotechnol (2021).

Bioconda
Grüning, B., Dale, R., Sjödin, A. et al. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods 15, 475–476 (2018).

eisaR
Stadler MB, Gaidatzis D, Burger L, Soneson C (2020). eisaR: Exon-Intron Split Anaalysis (EISA) in R. R package version 1.0.

genomepy
Heeringen, (2017), genomepy: download genomes the easy way, Journal of Open Source Software, 2(16), 320.

GffRead
Pertea G and Pertea M. GFF Utilities: GffRead and GffCompare [version 2; peer review: 3 approved]. F1000Research 2020, 9:304.

Kallisto
Bray, N., Pimentel, H., Melsted, P. et al. Near-optimal probabilistic RNA-seq quantification. Nat Biotechnol 34, 525–527 (2016).

Python
Python Core Team (2015). Python: A dynamic, open source programming language. Python Software Foundation.

R
R Core Team (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.

STAR
Alexander Dobin, Carrie A. Davis, Felix Schlesinger, Jorg Drenkow, Chris Zaleski, Sonali Jha, Philippe Batut, Mark Chaisson, Thomas R. Gingeras, STAR: ultrafast universal RNA-seq aligner, Bioinformatics, Volume 29, Issue 1, January 2013, Pages 15–21.

SingleCellExperiment
Amezquita, R.A., Lun, A.T.L., Becht, E. et al. Orchestrating single-cell analysis with Bioconductor. Nat Methods 17, 137–145 (2020).

Snakemake
Mölder F, Jablonski KP, Letcher B et al. Sustainable data analysis with Snakemake [version 2; peer review: 2 approved]. F1000Research 2021, 10:33.