The purpose of the pipeline is to analyze TnSeq data. TnSeq data usually is in the form of single-end short reads. Overall the reads need to be trimmed and mapped onto a reference genome. "TA" sites are identified across the whole reference genome. A series of steps are used to compile the number of reads mapped (count) onto TA sites accross the genome, and then narrowed down into segments of the genomes that are encoded by gene. Overall the final table is a table of Genes with the number of TA sites insertions across replicates.
To run this workflow, you will need:
-
You will need single-end sequences, in the form of
{sample}_R1_001.fastqper sample. You may use the data you receive directly from the Sequencing Center. -
You will also need a reference genome: a
fastafile and agfffile containing coordinates of the genomic features. You can usually download the fasta and gff file directly from NCBI.
This repository contains scripts to analyze TnSeq data on the UW-Madison Center for High Throughput Computing, using HTcondor and DAGman to automate job runs.
This repository contains a a folder named scripts containing a series of paired sh and submit files, containing the commands to be run, and the resources needed to run the job, respectively.
Helpers scripts (.sh, .py) are available to set up the project structure or summarize all the data at the end.
The pipeline is designed to be fully reproducible from start to finish through the use of software containers, executable scripts and various helper scripts. Also I designed this workflow such that this can be rerun for more projects, for example, if we were to generate new sequencing files, change reference genomes, etc.
-
set_up.sh: a bash script that takes in as an argument your netID and a project name. Will create all the input and output folder structure needed for this pipeline. -
make_custom_dag.sh: a bash script that will create a DAG for each sample -
create_main_dag.sh: a bash script that will create a "super DAG" based off the .dag files you created usingmake_custom_dag.sh. -
01-cutadapt.sub: cutadapt script that takes trim the adapter off of the FASTQ file. For example,mysequenceADAPTERsomethingelse, will keepmysequenceonly. -
02-index_ref.sub: Builds the genome index file for the reference genome, to prepare for mapping. -
03-map-to-ref.sub: Maps a sample to the reference genome. -
04-sort-and-index-bam.sub: Take the aligned reads, sorts and index the file. -
05-find-TA.sub: Find all instances of the "TA" in the reference genome.count-TA.py: helper script
-
06-extract-insertion-sites.sub: From the mapped file, get all the reads that overlap with "TA" in the reference genome. -
07-count-insertions-per-TA-sites.sub: Using the file in step 6, sum up how many reads (insertions) occur at each TA site. -
08-gff-to-bed.sub: Convert the GFF annotation file of the reference genome into abedformat file. -
gff-to-bed.py: helper script -
09-count-insertions-per-TA-sites-with-genes.sub: Compiles the number of insertions (reads mapping to TA) among the genes (features) of the reference genome. -
merge_files.sh: Take all the files from step09for each sample and create 1 single table. Each column is a column.
The output of merge_files.sh should look like this: a file with at least 6 columns:
- Accession ID
- Start
- End
- Gene Name
- Something
- Orientation
- Followed by the number of samples analysed.
NC_000964.3 409 1750 dnaA . + 96 76 47 64 55 49
NC_000964.3 1938 3075 dnaN . + 7 11 13 6 4 9
NC_000964.3 3205 3421 rlbA . + 0 0 0 1 1 0
NC_000964.3 3436 4549 recF . + 6 5 8 4 10 4
NC_000964.3 4566 4812 remB . + 170 48 49 2 32 17
NC_000964.3 4866 6783 gyrB . + 33 53 34 15 24 27
NC_000964.3 6993 9459 gyrA . + 25 18 10 9 30 16
NC_000964.3 9809 11364 rrnO-16S . + 220 137 232 72 89 7
- Log into the server
ssh netid@ap2002.chtc.wisc.edu
# enter password
# make a copy of this github repository
git clone https://github.com/patriciatran/tnseq.git
- Navigate to the tnseq folder, and enter the scripts folder
cd tnseq/scripts
ls
- Create the folder structured for the input and output files. The script
set_upneeds two arguments: yournetIDand your project name.
bash set_up.sh bbadger Set1
In the example above, bbadger is the netid and Set1 is the projec name.
This will create the following folder structure:
/staging/bbadger/Set1
/staging/bbadger/Set1/data
/staging/bbadger/Set1/cutadapt
/staging/bbadger/Set1/aligned
/staging/bbadger/Set1/gff
/staging/bbadger/Set1/insertions
/staging/bbadger/Set1/alignment
- You will need 3 input file types:
- Single end paired reads for each sample (e.g.
sample_R1_001.fastq.gz) - The reference genome, this can be a reference genome from NCBI (
reference.fasta) - A GFF annotation, which is a
gene feature filesthat contains all the annotations of the reference genome. This can also be downloaded from NCBI (reference.gff). Make sure that the file name matches that of the fasta file, but just has a different file extension. For example,Bacillus168.fastaandBacillus168.gff.
To transfer files from your computer onto the Staging folder, do the following.
Assuming that you have a folder on your laptop:
~/Downloads/data/sample1_R1_001.fastq.gz
~/Downloads/data/sample2_R1_001.fastq.gz
~/Downloads/data/sample3_R1_001.fastq.gz
~/Downloads/data/Bacillus168.fasta
~/Downloads/data/Bacillus168.gff
You can transfer the whole folder by typing this: Open a new Terminal tab
cd ~/Downloads
scp -r data/* netid@ap2002.chtc.wisc.edu:/staging/netid/Set1/data/.
The scp command with the -r flag allows to transfer a whole folder.
The first argument data is the folder to be transferred. This is followed by a space. Then we use the netID login onto the server, followed by :, and then the path to where we want to transfer these files to. In this case, we want to transfer them into the data subfolder that we created in Step 3.
Change Terminal tabs for the one that you're logged into the server. Make sure the files have been transferred correctly by listing the folder contents.
ls -lht /staging/bbadger/Set1/data
- Next, we will create the DAG that defines the steps of our workflow. We will use the script
create_make_custom_dagto automatically create this DAG file.
The create_custom_dag.sh script takes many arguments, in a very specific order.
bash create_custom_dag.sh <samples_list> <netid> <adapter> <feature> <ref> <project>
samples_listis the list of samples found in/staging/netid/project/data/, without the file extension R1_001.fastq.gz.netid: your NetIDadapter: the adapter sequence, specific to the TnSeq experiment. No quotation marks, just the sequence.feature: the feature of the GFF file to extract, when counting the sum of TA in those features. Commons ones aregene,pseudogeneetc. To look at all the options open the GFF file and look at the 3rd column.reference: the name of the reference genome. reference.fasta or reference.gffproject: the same name as in the set_up.sh script.
This will create multiple files named tnseq_SAMPLE.dag
Here is an example of a command with arguments:
bash create_custom_dag.sh samples.txt bbadger ACAGGTTGGATGA gene Bacillus168 BothSets
- Create a main DAG using
create_main_dag.sh. This only takes two arguments: your samples.txt list and an output file name for the final DAG.
bash create_main_dag samples.txt BothSets.dag
This will create a file named BothSets.dag
- Submit the workflow.
condor_submit_dag BothSets.dag
Wait for results to populate the /staging/netid/project/ folder. You can close the terminal and come back - your jobs will automatically be submitted and ran.
- Summarize the results. A script named
merged_files.shis available to help you summarize your results. This script takes 2 arguments: yournetidand yourprojectname.
bash merged_files.sh bbadger BothSets
This will create a file named BothSets_merged_output.txt containing all sum of insertations per feature across all the samples in the given project.
Once the data is processed by this pipeline, note that these are the counts of insertions per feature (e.g. genes, etc.)
Some common steps after this would be to normalize the data by the total reads, or by the gene length, and finding out which genes are considered Essential or Non-Essential across samples. Genes with few or no transposons in them are usually considered Essential. Because, if there were transposons that fell into the gene, and disrupted function, yet the bacterial is still growing normally, it means that gene was not that important for the system to be studied to begin with.
This workflow uses:
-
Cutadapt v.5.1
-
Bowtie version 1 (v.1.3.1)
Langmead, B., Trapnell, C., Pop, M. et al. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10, R25 (2009). https://doi.org/10.1186/gb-2009-10-3-r25
- Samtools
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R; 1000 Genome Project Data Processing Subgroup. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009 Aug 15;25(16):2078-9. doi: 10.1093/bioinformatics/btp352. Epub 2009 Jun 8. PMID: 19505943; PMCID: PMC2723002.
- BedTools (v.2.27.1)
Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010 Mar 15;26(6):841-2. doi: 10.1093/bioinformatics/btq033. Epub 2010 Jan 28. PMID: 20110278; PMCID: PMC2832824.
- Python
This code and these instructions were written by Patricia Q. Tran for the project of Arunima Kalita.
For any questions, please contact ptran5 [at] wisc [dot] edu
