AmpliconPE

Extracts DNA barcodes from Paired-End (PE) amplicon sequencing using a Smith Waterman alignment. Useful for lineage tracing, multiplexed-sgRNA pools, and other ultra-deep sequencing projects.

Why use AmpliconPE for barcode extraction?

From Johnson et al 2022:

Extracting barcodes from the sequencing reads may appear as a trivial problem at first glance, given that the structure of the read is known by design. However, the challenge is that not all reads may have identical structure...

PE (redundant) seqencing is as cheap as Single-End sequencing(!) and eliminates 99% of errors
Both Homology to the reference read and agreement between the forward- and reverse-reads are informative (most people only threshold on one attribute)
Homology-based alignment to a reference read is the most robust way to correct read anomalies
Its SIMD-fast & highly extensible

Why not use a RegEx expression?

Fuzzy RegEx libraries use sequence aligners, so they potentially could mimic some of our functionality; however, they aren't designed for DNA sequences and lack important functionality in this package.

Usage

We provide command-line scripts to process common amplicon constructs (e.g. TuBa-seq, CLONtracer, Brunello/Brie CRISPRko libraries).

Usage in each is built around a MasterRead class, which aligns PE reads to a reference sequence:

from AmpliconPE import MasterRead
tuba_seq_reference_seq = 'GACCCGGA'            +    # 5' flanking sequence of double-barcode (8 nts is good)
                         'NNNNNNNN'            +    # Barcode of known possibilities
                         'AA'                  +    # spacer
                         'NNNNNTTNNNNNAANNNNN' +    # Random Barcode w/ spacers - specified by 'N'
                         'ATGCCCAA'                 # 3' flank
                       
master_read = MasterRead(tuba_seq_reference_seq)

MasterRead automatically extracts barcode start and end locations...

print(master_read.barcode_start, master_read.barcode_stop)
> (8, 37)

Reads are then processed using pairedFASTQiter:

from AmpliconPE import pairedFASTQiter

Iter = pairedFASTQiter('forward_file.fastq.gz', 'reverse_file.fastq.gz', check_indecies=True)

fwd_read, rev_read = next(Iter)

Index-Hopping dramatically undermines the power of barcode sequencing. For this reason, we recommend (1) sequence only primer-free libraries, (2) use Dual-Unique Indecies, and (3) filter non-matching Forward/Reverse index pairs using the check_indecies keyword argument.

Barcodes can be generated by random oligosynthesis (random barcodes) or synthesized from a pre-generated list (known barcodes). Our BarcodeSet class extracts known barcodes in a mismatch-tolerant manner. Use BARCOSEL to generate barcode sets that are robust to sequencing errors.

import pandas as pd
target_genes = pd.read_csv('sgRNA_info.csv').set_index("Targeted Gene")['barcode']

from AmpliconPE import BarcodeSet
known_barcodes = BarcodeSet(target_genes, n_mismatches=1)

Barcodes are then extracted using an internal-loop that generally looks something like this:

  score = master_read.score(fwd_read, rev_read)
  if score < 0.8 * master_read.max_score:
    continue  # poor alignment
  
  barcode = master_read.extract_barcode(fwd_read, rev_read)
  if barcode == 'Length Mismatch':
    continue
  
  known_barcode = barcode[:8]
  sgRNA_target = known_barcodes.get(known_barcode, 'Unknown Target')
  
  random_barcode = barcode[8:]
  if 'N' in random_barcode:
    continue

In short, MasterRead internally-stores the reference alignment to both the forward and reverse read when you call the align method--- alignment is the performance-limiting step and implemented using SSW Library. Single nucleotide differences between the forward and reverse barcodes are replaced with 'N', while InDel differences between barcodes returns 'Length Mismatch'.

Generally, reads are kept if they match the reference read well and if forward and reverse barcodes match. Barcode pileups are then tallied and processed using downstream software. You may want to use a barcode clusterer, e.g. Shepard, to de-noise random barcodes; however, PE sequencing generally resolves most reccurrent-read errors. The function AmpliconPE.identify_neighbors cleans-up simple recurrent read errors.

Name		Name	Last commit message	Last commit date
Latest commit History 180 Commits
.github/workflows		.github/workflows
Complete-Striped-Smith-Waterman-Library/src		Complete-Striped-Smith-Waterman-Library/src
bin		bin
data		data
examples		examples
src/AmpliconPE		src/AmpliconPE
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AmpliconPE

Why use AmpliconPE for barcode extraction?

Why not use a RegEx expression?

Usage

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AmpliconPE

Why use AmpliconPE for barcode extraction?

Why not use a RegEx expression?

Usage

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages