Extracts DNA barcodes from Paired-End (PE) amplicon sequencing using a Smith Waterman alignment. Useful for lineage tracing, multiplexed-sgRNA pools, and other ultra-deep sequencing projects.
From Johnson et al 2022:
Extracting barcodes from the sequencing reads may appear as a trivial problem at first glance, given that the structure of the read is known by design. However, the challenge is that not all reads may have identical structure...
- PE (redundant) seqencing is as cheap as Single-End sequencing(!) and eliminates 99% of errors
- Both Homology to the reference read and agreement between the forward- and reverse-reads are informative (most people only threshold on one attribute)
- Homology-based alignment to a reference read is the most robust way to correct read anomalies
- Its SIMD-fast & highly extensible
Fuzzy RegEx libraries use sequence aligners, so they potentially could mimic some of our functionality; however, they aren't designed for DNA sequences and lack important functionality in this package.
We provide command-line scripts to process common amplicon constructs (e.g. TuBa-seq, CLONtracer, Brunello/Brie CRISPRko libraries).
Usage in each is built around a MasterRead class, which aligns PE reads to a reference sequence:
from AmpliconPE import MasterRead
tuba_seq_reference_seq = 'GACCCGGA' + # 5' flanking sequence of double-barcode (8 nts is good)
'NNNNNNNN' + # Barcode of known possibilities
'AA' + # spacer
'NNNNNTTNNNNNAANNNNN' + # Random Barcode w/ spacers - specified by 'N'
'ATGCCCAA' # 3' flank
master_read = MasterRead(tuba_seq_reference_seq)MasterRead automatically extracts barcode start and end locations...
print(master_read.barcode_start, master_read.barcode_stop)
> (8, 37)Reads are then processed using pairedFASTQiter:
from AmpliconPE import pairedFASTQiter
Iter = pairedFASTQiter('forward_file.fastq.gz', 'reverse_file.fastq.gz', check_indecies=True)
fwd_read, rev_read = next(Iter)Index-Hopping dramatically undermines the power of barcode sequencing. For this reason, we recommend (1) sequence only primer-free libraries, (2) use Dual-Unique Indecies, and (3) filter non-matching Forward/Reverse index pairs using the check_indecies keyword argument.
Barcodes can be generated by random oligosynthesis (random barcodes) or synthesized from a pre-generated list (known barcodes). Our BarcodeSet class
extracts known barcodes in a mismatch-tolerant manner. Use BARCOSEL to generate barcode sets that are robust to sequencing errors.
import pandas as pd
target_genes = pd.read_csv('sgRNA_info.csv').set_index("Targeted Gene")['barcode']
from AmpliconPE import BarcodeSet
known_barcodes = BarcodeSet(target_genes, n_mismatches=1)Barcodes are then extracted using an internal-loop that generally looks something like this:
score = master_read.score(fwd_read, rev_read)
if score < 0.8 * master_read.max_score:
continue # poor alignment
barcode = master_read.extract_barcode(fwd_read, rev_read)
if barcode == 'Length Mismatch':
continue
known_barcode = barcode[:8]
sgRNA_target = known_barcodes.get(known_barcode, 'Unknown Target')
random_barcode = barcode[8:]
if 'N' in random_barcode:
continueIn short, MasterRead internally-stores the reference alignment to both the forward and reverse read when you call the align method--- alignment
is the performance-limiting step and implemented using SSW Library. Single nucleotide differences between the forward and reverse barcodes
are replaced with 'N', while InDel differences between barcodes returns 'Length Mismatch'.
Generally, reads are kept if they match the reference read well and if forward and reverse barcodes match. Barcode pileups are then tallied and
processed using downstream software. You may want to use a barcode clusterer, e.g. Shepard, to de-noise random barcodes; however, PE sequencing
generally resolves most reccurrent-read errors. The function AmpliconPE.identify_neighbors cleans-up simple recurrent read errors.