Skip to content

cancerevo/AmpliconPE

Repository files navigation

Code style: black Python application

AmpliconPE

Extracts DNA barcodes from Paired-End (PE) amplicon sequencing using a Smith Waterman alignment. Useful for lineage tracing, multiplexed-sgRNA pools, and other ultra-deep sequencing projects.

Why use AmpliconPE for barcode extraction?

From Johnson et al 2022:

Extracting barcodes from the sequencing reads may appear as a trivial problem at first glance, given that the structure of the read is known by design. However, the challenge is that not all reads may have identical structure...

  • PE (redundant) seqencing is as cheap as Single-End sequencing(!) and eliminates 99% of errors
  • Both Homology to the reference read and agreement between the forward- and reverse-reads are informative (most people only threshold on one attribute)
  • Homology-based alignment to a reference read is the most robust way to correct read anomalies
  • Its SIMD-fast & highly extensible

Why not use a RegEx expression?

Fuzzy RegEx libraries use sequence aligners, so they potentially could mimic some of our functionality; however, they aren't designed for DNA sequences and lack important functionality in this package.

Usage

We provide command-line scripts to process common amplicon constructs (e.g. TuBa-seq, CLONtracer, Brunello/Brie CRISPRko libraries).

Usage in each is built around a MasterRead class, which aligns PE reads to a reference sequence:

from AmpliconPE import MasterRead
tuba_seq_reference_seq = 'GACCCGGA'            +    # 5' flanking sequence of double-barcode (8 nts is good)
                         'NNNNNNNN'            +    # Barcode of known possibilities
                         'AA'                  +    # spacer
                         'NNNNNTTNNNNNAANNNNN' +    # Random Barcode w/ spacers - specified by 'N'
                         'ATGCCCAA'                 # 3' flank
                       
master_read = MasterRead(tuba_seq_reference_seq)

MasterRead automatically extracts barcode start and end locations...

print(master_read.barcode_start, master_read.barcode_stop)
> (8, 37)

Reads are then processed using pairedFASTQiter:

from AmpliconPE import pairedFASTQiter

Iter = pairedFASTQiter('forward_file.fastq.gz', 'reverse_file.fastq.gz', check_indecies=True)

fwd_read, rev_read = next(Iter)

Index-Hopping dramatically undermines the power of barcode sequencing. For this reason, we recommend (1) sequence only primer-free libraries, (2) use Dual-Unique Indecies, and (3) filter non-matching Forward/Reverse index pairs using the check_indecies keyword argument.

Barcodes can be generated by random oligosynthesis (random barcodes) or synthesized from a pre-generated list (known barcodes). Our BarcodeSet class extracts known barcodes in a mismatch-tolerant manner. Use BARCOSEL to generate barcode sets that are robust to sequencing errors.

import pandas as pd
target_genes = pd.read_csv('sgRNA_info.csv').set_index("Targeted Gene")['barcode']

from AmpliconPE import BarcodeSet
known_barcodes = BarcodeSet(target_genes, n_mismatches=1)

Barcodes are then extracted using an internal-loop that generally looks something like this:

  score = master_read.score(fwd_read, rev_read)
  if score < 0.8 * master_read.max_score:
    continue  # poor alignment
  
  barcode = master_read.extract_barcode(fwd_read, rev_read)
  if barcode == 'Length Mismatch':
    continue
  
  known_barcode = barcode[:8]
  sgRNA_target = known_barcodes.get(known_barcode, 'Unknown Target')
  
  random_barcode = barcode[8:]
  if 'N' in random_barcode:
    continue

In short, MasterRead internally-stores the reference alignment to both the forward and reverse read when you call the align method--- alignment is the performance-limiting step and implemented using SSW Library. Single nucleotide differences between the forward and reverse barcodes are replaced with 'N', while InDel differences between barcodes returns 'Length Mismatch'.

Generally, reads are kept if they match the reference read well and if forward and reverse barcodes match. Barcode pileups are then tallied and processed using downstream software. You may want to use a barcode clusterer, e.g. Shepard, to de-noise random barcodes; however, PE sequencing generally resolves most reccurrent-read errors. The function AmpliconPE.identify_neighbors cleans-up simple recurrent read errors.

About

Extracts DNA barcodes from Paired-End Amplicon Sequencing

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors