The scripts in this repo were used to simulate DNA barcode sequencing results, including both UMI (unique molecular identifier) sequencing and non-UMI sequencing techniques.
The ART sequencing simulator is included in this repo, but you can also install it by yourself. The building ART simulator can work on 64-bit Linux, MacOS and Windows.
In case to download the ART NGS simulator, please check out following link:
https://www.niehs.nih.gov/research/resources/software/biostatistics/art/index.cfm
To use the customed ART simulator, please make the art_bin variable point to the art_illumina file.
Otherwise, please keep the art_bin variable as default, which is NULL, to use the built-in ART simulator.
The following R packages are required for running the simulation:
Script for installing the simulator:
## install remotes package if you do not have it
install.packages("remotes")
## install CellBarcodeSim from github
remotes::install_github("TeamPerie/CellBarcodeSim")The example1_simulate_sequencing.R script in the viggnettes fold will give you a demo of barcode sequencing simulation, you can find in the root dir.
Normally, it takes less than 1 min to run it.
Though we provided some barcode library in viggnettes/ folder, if you need to generate more barcode libraries with customed parameters you can explore the example in example2_simulate_barcode_library.R.
The barcode library file is needed for the sequencing simulation function described below.
The file should be a tab-separated file (TSV) and consist of two columns,
seqcolumn: barcode sequence.freqcolumn: barcode frequency. It is used as barcode probability while sampling the barcode library to label the cells.
Three example barcode libraries are in the inst/data fold. They are:
- Random barcode:
random_barcodes.tsv` contains a 14bp random barcode library consisting of 1e6 simulated sequences. Exists some duplicated sequences. - Hamming distance barcode:
hamming_barcodes.tsvcontains 9155 14bp sequences that have been identified with a minimum Hamming distance of 3 between each sequence. - Vdj barcode:
vdj_barcodes.tsvcontains$10^7$ simulated VDJ barcodes that contain around$1.4\times10^5$ unique barcodes.
Once the barcode library is loaded, you can get the barcode library by using:
barcode_library = system.file("data", "random_barcodes.tsv", package = "CellBarcodeSim")The simulation methods were described in the paper (TBD: add paper link).
There are also 8 functions to simulate barcode libraries.
They simulated 2 types of barcodes library X 4 barcode library distribution due to potential barcode amplification bias.
The 2 types of barcodes library are:
- Random barcode: The barcode sequences are random.
- Hamming distance barcode: The barcode sequences have a minimum Hamming distance between each other.
The 4 barcode library distribution are:
- Uniform distribution.
- Normal distribution.
- Lognormal distribution.
- Exponential distribution.
The four functions used to simulate the hamming distance barcode library are:
simu_barcode_hamming_uniform(): Simulate the hamming distance barcode library with uniform distribution.simu_barcode_hamming_norm(): Simulate the hamming distance barcode library with normal distribution.simu_barcode_hamming_lnorm(): Simulate the hamming distance barcode library with lognormal distribution.simu_barcode_hamming_exp(): Simulate the hamming distance barcode library with exponential distribution.
Those wrap the barcode simulator in the DNAbarcodes package to simulate the barcode library given hamming distance. It accepts parameters:
length: barcode length.dist: hamming distance.output: output file path.mean: mean of the distribution, only forsimu_barcode_hamming_norm().sd: standard deviation of the distribution, only forsimu_barcode_hamming_norm().log_mean: log mean of the distribution, only forsimu_barcode_hamming_lnorm().log_sd: log standard deviation of the distribution, only forsimu_barcode_hamming_lnorm().rate: rate of the distribution, only forsimu_barcode_hamming_exp().top_fix: 5 end fix flanking sequence.bottom_fix: 3 end fix flanking sequence.
The output file dir is given, the output file is a two columns TSV file which is described in the previous section.
Otherwise, the function will return a data.table with seq and freq column.
For more complex cases please simulate directly with the DNABarcodes package.
The four functions used to simulate the random barcode library are:
simu_barcode_random_uniform(): Simulate the random barcode library with uniform distribution.simu_barcode_random_norm(): Simulate the random barcode library with normal distribution.simu_barcode_random_lnorm(): Simulate the random barcode library with lognormal distribution.simu_barcode_random_exp(): Simulate the random barcode library with exponential distribution.
They accept parameters:
length: barcode length.n: number of sequences to be simulated. The unique barcodes can be less than the total sequence number due to the potential duplications.output: output file path.mean: mean of the distribution, only forsimu_barcode_random_norm().sd: standard deviation of the distribution, only forsimu_barcode_random_norm().log_mean: log mean of the distribution, only forsimu_barcode_random_lnorm().log_sd: log standard deviation of the distribution, only forsimu_barcode_random_lnorm().rate: rate of the distribution, only forsimu_barcode_random_exp().
We can run the simulate_main() function to do the simulation without UMI.
You can find an example in example.R file.
Following are the parameters and the default:
simulate_main(
barcode_library_file = NULL,
barcode_library = NULL,
clone_size_dist = "uniform",
clone_n = 300,
clone_size_dist_par = list(size_max = 1000, size_min = 1),
cycle = 30,
efficiency = 0.705,
error = 1e-6,
pcr_read_per_cell = 50,
output_prefix = "./tmp/simu_seq",
ngs_profile = "MSv1",
reads_length = 10,
is_replicate = F,s
top_seq = "",
bottom_seq = "",
sequence_trunk = 10,
art_bin = NULL
)The parameters:
-
barcode_library_file: The location of the barcode frequency list. This parameter is necessary and has no default value. We sample the list to have the barcode for labeling. Please refer to the sections above to know how to prepare it. -
barcode_library: If thebarcode_library_fileis empty, adata.frameof barcode library is used here. Thedata.frameshould contain two columnsseqandfreq. -
clone_size_dist: clone size distribution, it can beuniformorlognormal. Theclone_size_dist_parshould match the chosen model (see more detail below). -
clone_n: Number of labeled cells (progenitors) at initiation, default 300. -
clone_size_dist_par: a list. Ifuniformclone size distribution is chosen, the list should contain two itemssize_maxandsize_minfor the range of the uniform distribution. Iflognormalis chosen, the list should containsize_meanwhich is log mean, andsize_variantwhich is log sd. Ifpowerlawclone size distribution is chosen, the list should contain three itemsconstant,scaleandalpha. They are corresponding to the parameters in following equation:$$f(scale \cdot x) = constant (scale \cdot x)^{-\alpha}$$ Where$x \in [0,1]$ . -
cycle: PCR cycle number. The run time and output file size increase will increase exponentially with thecyclevalue, because of the exponential nature of PCR amplification. -
efficiency: PCR efficiency with a default value of 0.705 for each cycle. -
error: PCR error per base with a default value of$10^{-6}$ . -
pcr_read_per_cell: Expacted sequencing reads per each progeny cell. -
output_prefix: Output prefix. -
ngs_profile: NGS profile, this option will be transmitted to ART sequencing simulator. The building profile is, "MSv1", "HS20" etc. For more detail, please check the ART simulator. -
reads_length: The NGS sequencing reads length. It should be shorter than the PCR results. You can usetop_seqorbottom_seq(described below) to add the constant sequences to increase the PCR sequence length. -
is_replicate: Generate technical replicates or not. The technical replicates will generate by dividing the progeny cells into two samples and do the PCR for each. The sequencing data will be labeled by pending_2to label the replicate. -
top_seq: Add constant sequences to the left (5') of the PCR result. -
bottom_seq: Add constant sequences to the right (3') of the PCR result. -
sequence_trunk: Substring the barcode sequence by choosing the first n base, default 10. If the value is bigger than the barcode length, the barcode sequence will not be truncated. -
art_bin: The executable ART simulator location. The default isNULL, which makes use of the building ART simulator.
We can simulate the UMI sequencing result with the function simulate_main_umi().
simulate_main_umi(
barcode_library_file = NULL,
barcode_library = NULL,
clone_size_dist = "uniform",
clone_n = 300,
clone_size_dist_par = list(size_max = 1000, size_min = 1),
cycle = 40,
efficiency = 0.705,
error = 1e-6,
pcr_read_per_umi = 50,
output_prefix = "./tmp/simu_umi_seq",
ngs_profile = "MSv1",
reads_length = 100,
top_seq = "",
bottom_seq = "",
sequence_trunk = 10,
preamp_n = 0,
umi_length = 8,
umi_tagging_efficiency = 0.25,
art_bin = NULL
)For most of the parameters, you can find explanations in the documents of simu_main.
The simulate_main_umi() specific parameters:
preamp_n: PCR cycle before catenating UMI, default is 0.umi_length: The UMI base pair length, default is 8.umi_tagging_efficiency: The UMI tagging efficiency, default is 0.25. It means 25% of sequences in the pre-PCR pool will be tagged by UMI.
The output location and file name are defined by the output_prefix option.
For example, by running the example.R, the option is ./tmp/simu_seq. The output files are in ./tmp/ fold:
simu_seq_ref.tsv: The induced barcodes.simu_seq_library.fasta: The sequencing library, which will only carry the PCR error.simu_seq.aln: The intermediate files from ART.simu_seq.fq: Fianl barcode sequencing fastq file.