Pipeline to design one FISH probeset for each provided input. Three input types are allowed:
- Gene annotations in gtf / gtf.gz format
- Genomic regions in bed / bed.gz format
- Nucleotide sequences in fasta / fasta.gz format
The GTF-based workflow takes a GTF annotation file to retrieve coordinates and nucleotide sequences of each gene, transcript and exon. In this workflow, all exons belonging to the same transcript isoform are merged together (intronic regions are dropped) to form one concatenated sequence featuring exon-exon junctions, which is used to design a certain number of kmer oligos to be used in RNA FISH experiments. The BED-based workflow can be used to test entire ungapped regions based on their coordinates. The FASTA-based workflow can be used to test nucleotide sequences, being therefore useful in situations where coordinates or identifiers are not available.
A Singularity Image can be provided on request. Otherwise, the pipeline can be installed using this Dockerfile to produce a Docker Container and convert it to Singularity Image. Follow the guide below to install everything. Both Singularity and SLURM are required to run the pipeline.
-
Download Files:
git clone https://github.com/BiCroLab/fish_probe_design.gitcd fish_probe_design/installation/unzip prbdocker-master.zip && cd prbdocker-master -
Create Docker Container:
docker build -t prbdocker . -
Convert Docker to Singularity:
docker run -v /var/run/docker.sock:/var/run/docker.sock -v ".":/output \
--privileged -t --rm singularityware/docker2singularity:v2.6 prbdocker
- First,
git clone https://github.com/BiCroLab/fish_probe_design.git - Adjust all user-specific variables in prb.config
- Launch the whole pipeline with
bash main.sh
The pipeline consists of a main.sh script that manages a series of modules.
All variables can be controlled and edited from a prb.config text file:
${INPUT_GTF}annotation file in.gtf/.gtf.gzformat.${INPUT_FASTA}annotation file in.fasta/.fasta.gzformat.${INPUT_BED}annotation file in.bed/.bed.gzformat.${GENOME}path to genome.fa/.fa.gzhaving.fai/.gziindex.
All chromosome names should start with the prefix chr and have no additional spaces.
Required index files can be produced withsamtools faidx.${BASEDIR}/${WORKDIR}base path and output directory name.${OLIGO_LENGTH}length of probe oligos (default is 40).${OLIGO_SUBLENGTH}sublength of probe oligos (default is 21).${SPACER}value affecting average oligo density (default is 10bp).
For each input, N represents the maximum number of oligos to be found. This number corresponds to
${WIDTH} / (${OLIGO_LENGTH} + ${SPACER}). If N suitable candidates are not found, the pipeline will reduce N and retry. For example:5000bp region/ (40bp oligos+10bp spacer) could yield up to a maximum of 100 oligos.
All results will be saved in a single ${WORKDIR}/prb_results directory, with one file .tsv for each provided input. Output filenames also includes the final number of found oligos, which is equal or inferior to the maximum N value. At this stage, users might want to double-check whether the number of found oligos dropped significantly with respect to the original N and control if oligos were evenly distributed throughout the sequence. The pw score indicates the overall quality of the entire probeset and, if possible, users should try to avoid very low values. However, when using very short regions as inputs, users might consistently get low values. In most situations, provided that there are enough oligos to get a detectable fluorescence signal, users might safely ignore this parameter. General suggestions: (1) try to squeeze as many oligos as possible in a region to get stronger signal; (2) avoid excessively gapped probesets, as they could form separate dots.
Since most companies that synthesize oligos apply big discounts when ordering several sequences at once, it is recommended to group together multiple probesets in one or few oligo-pools. All oligos of a given probeset will be further modified to attach two flanking sequences, called flaps (see figure), which can be used to bind fluorophores as well as to selectively amplify the whole probeset from a oligo-pool mixture. The used flaps sequences should not hybridize with the target genome to prevent off-targets and interferences. We provide a series of scripts that can be used to calculate orthogonal kmers for flaps. It is necessary to compute these sequences only once for each reference genome. Combining different left and right flaps sequences can theoretically allow a large number of combinations with a relatively low number of orthogonal sequences. Some pre-computed 20-mers are available for the human genome and can be provided on request.
This section explains how to integrate the previous information and prepare a final table that can be supplied to companies for oligo synthesis. A semi-manual approach is recommended here. Assuming that users are interested in visualizing several probes in the same experiment, while using a limited number of channels, they might want to consider what fluorescent color will be assigned to each probeset. In this situation, for probes of the same groups or conditions, it is advised to assign a common sequence for one of the two flap sequence. Although this is not fundamental, it can simplify and speed up pipetting for amplification, and also reduce the chances that wrong fluorophores could get attached to some oligos. This strategy can be ignored completely if users are interested in a relatively low number of regions and have an excess of orthogonal sequences to create unique combinations. We provide an example script that integrates flaps and oligo sequences and creates an output excel file that can be used for ordering probes.
Check here for further information.