This tool identifies depletion guide candidates (sgRNAs) targeted to specific genes.
This is a generalization of https://github.com/gprezza/DASH_rRNA_depletion to allow any species and any target location(s). Typically, DASH is used for species-specific rRNA depletion, but could be used to target any gene(s).
Provided a FASTA file, a GTF file of region(s) to deplete, a PAM sequence
(e.g. "NGG" for wildtype Cas9) and a guide length, CandidateFinder will:
- identify all possible PAM sequences
- search for all PAM sites on both strands of the regions to deplete
- return candidates sgRNA of specified length if their PAM site is sufficiently far from the end (sgRNA is immediately 5' of PAM site)
- filter candidates based on the thresholds for GC content, presence of heterodimer with primer, high end stabitily with primer
- off-target candidates are discarded. Off-target is defined as mapping regions outside of the target regions coupled with being adjacent to a PAM site 3' end
- compose the full oligo sequence of the candidates
- conda must be installed
- this code repository must be downloaded and unpacked. Below, we refer to the
unpacked directory as
$WORKDIR. Unless otherwise specified, paths provided are assumed to be relative to$WORKDIR.
Create the conda environments in the top level of $WORKDIR:
conda create -y -p env --file requirements.txtIn the top level of $WORKDIR:
conda activate ./env
# install in editable mode
pip install -e .Then run tests:
pytest -vv test/test.pyTo run just a single test, use the -k flag, e.g.,
pytest -vv test/test.py -k test_orig_smallCandidateFinder(fasta, annotation, pam, guide_length)
Required input parameters:
- fasta : str
Path to genome fasta file (required)
- annotations : str or pybedtools.BedTool
If str, should be path to BED or GTF file that will be converted to
pybedtools.BedTool file. This conversion will automatically handle
1-based GTF and 0-based BED annotations to avoid off-by-one errors (required)
- pam : str
PAM sequence, e.g. "NGG" for wildtype Cas9 (required)
- guide_length : int
How long of a guide to select (required)
Optional parameters:
- min_gc: int
Lower threshold of acceptable GC percentage (inclusive) (optional; default=0)
- max_gc : int
Upper threshld of acceptable GC percentage (inclusive) (optional; default=100)
- het_tm : int
Degrees C threshold for Primer3's heterodimer routine (optional; default=40)
- end_tm : int
Degrees C threshold for Primer3's end stability routine (optional; default=30)
- t7 : str
T7 promoter (minus the first G) (optional; default="TTCTAATACGACTCACTATA")
- scaffold : str
Cas9 sgRNA scaffold (optional;
default="GTTTTAGAGCTAGAAATAGCAAGTTAAAATAAGGCTAGTCCGTTATCAACTTGAAAAAGTGGCACCGAGTCGGTGCTTTTTT")
- primer : str
Reverse complement of this Cas9 primer will be added to template sgRNA oligos (optional;
default="GGCATACTCTGCGACATCGT")
- bowtie_index_prefix : str
Path to directory containing existing bowtie2 index. If missing,
index will be created (optional; default='tmp/index')
- outfn : str
Output files path and basename. Default is 'candidates' (optional; default='candidates')
- plots : bool
Generate coverage plots (optional; default=True)
- legacy : bool
Match output to original DASH tool (optional; default=FalseThe output of CandidateFinder will be files named after the outfn parameter, which we will
refer as $OUTFN. This can be a file prefix (candidates), or a path followed by a file prefix
(i.e. output_dir/candidates)
-
$OUTFN_oligos.csv: comma-separated file of unique full oligo sequences. They are built from the T7 promoter sequence, the candidate guide, the Cas9 sgRNA scaffold and the reverse complement of the primer. -
$OUTFN_grnas.bed: BED file of where the sgRNAs mapped to the reference genome. A sgRNA mapping to several positions in the genome will be listed at each of its locations. -
$OUTFN_grnas.fa: FASTA sequence of the sgRNAs only. -
$OUTFN_coverage.pdf: map of the sgRNAs along the reference genome