Please note that this step is only neccessary if you plan to edit scripts and have access to this repository as a collaborator.
- Create a GitHub account here.
- Access the HPCC by going to https://ondemand.hpcc.msu.edu and clicking
dev-intel-16under theDevelopment Nodesdrop-down, which will open a terminal on a new tab. - In the terminal follow the instructions under "Generating a new SSH key" to create an SSH key here.
- Now that you have the key created you will need to run the
cat ~/.ssh/id_ed25519.puband copy the contents it displays (it should look like “ssh-ed25519 AAA …… email@website.com”). - Log in to GitHub and click the icon in the right hand corner and select “Settings”. Click the section labeled "SSH and GPG keys". Next, click "New SSH Key" paste the contents copied above into the box provided, and give it a good title (maybe your computer name) and an expiration date if desired.
The following should be done on the Lab MacBook:
-
Download the zipped data folder to the Desktop and unzip it. Verify that it contains two subfolders called
AandB, whereAcontains the sequencing files for the MUT reads, andBcontains the sequencing files for the WT reads. Each file should have a format likeV300098986_L03_PLAujbeR032370-663_1.fq.gz, where theL03indicates that the file is from sequencing lane 3, the_1means that it is theR1read, and thefq.gzimplies that it is a zipped fastq file. The current sequencing company splits reads across 3 lanes and these three files must be joined together in increasing order by the lane. For example, if the lanes areL02,L03, andL04, we must joinL02->L03->L04such thatL02is the beginning andL04is the end. The following script will ideally concatenate the three separate files for each fo the four read types: MUT R1, MUT R2, WT R1, and WT R2. -
Once we check that the data is formatted correctly, upload the data folder to the HPCC. This can be done in two ways:
- Online GUI: Go to https://ondemand.hpcc.msu.edu and navigate to
Files->Home Directory. Here, you can navigate through the GUI to the correct folder and hit the blueUploadbutton to upload folders or files from the local computer. - Terminal: Open terminal on the MacBook and run the following. This will prompt you to enter a password for your HPCC account (same as MSU account).
scp -r ~/Desktop/DataFolder <username>@hpcc.msu.edu:<DataFolder> # replace <>'s with your MSU username and the name of your data folder
- Online GUI: Go to https://ondemand.hpcc.msu.edu and navigate to
-
Access the HPCC terminal. This can also be done in two ways:
- HPCC Web Terminal:: Go to https://ondemand.hpcc.msu.edu and click
Development Nodes->dev-intel-16to open a terminal on a new tab. - Local Terminal: Open terminal on the MacBook and run the following. This will prompt you to enter a password for your HPCC account (same as MSU account).
ssh <username>@hpcc.msu.edu # replace <> with your MSU username
- HPCC Web Terminal:: Go to https://ondemand.hpcc.msu.edu and click
-
Set up an analysis directory adjacent to the data folder. This can be done in two ways:
- Manually: Download this GitHub repository to the MacBook as a zipped folder and unzip it. Next, rename the folder with the name of suppresor mutant. Then, upload the folder to the HPCC so that it is in the folder containing the data folder.
-
- Terminal, after manual download: Open terminal on the MacBook and run the following. This will prompt you to enter a password for your HPCC account (same as MSU account).
scp -r ~/Desktop/DataFolder <username>@hpcc.msu.edu:<DataFolder> # replace <>'s with your MSU username and the name of your data folder
- HPCC Web Terminal, straight from GitHub: Navigate to the HPCC's web terminal and run the following code. (This may require SSH keys to be set-up on the HPCC).
cd ~ git clone git@github.com:yashmanne/Benning_Simple.git mv Benning_Simple <analysis_folder> # replace {analysis_folder} with {your_sample_name}
Once the analysis folder is set-up, open the variables.sh file on the Web GUI by clicking the three dots next to the file and hitting EDIT to open the file on a new tab. The script should look as follows:
```bash
#### Change these to what yours are
# data folder
dataLocation="../DefaultData"
# output file
lineName="defaultEMS"
#### Note that in the example case, data was spread across 3 lanes L02, L03, L04. This may not always be the case and might need to be modified
lane1="L02"
lane2="L03"
lane3="L04"
```
Now, change the dataLocation variable to "../DataFolder", where "DataFolder" is the name of your data folder. Next, change lineName to "your_sample_name", where "your_sample_name" is the name of your sample. Next, change lane1 to the lowest lane value, in this case "L02" and lane2 to the next lowest value, and lane3 to the highest lane value. Save the script by hitting the blue SAVE button.
Once this file is saved, we can run subsequent HPCC scripts!
Navigate to the folder you uploaded from this GitHub to the HPCC (what you just named after your mutant):
cd ./<folder_name>/ To run the analysis, simply run the following in terminal:
bash runScripts.sh This will run all scripts listed below in the following order. These scripts can be found in the scripts folder of this repository.
import_data.sb: gathers all data into theinputfolder into a compatible file names.getReferences_Arabidopsis.sb: downloads necessary reference genome information.runBwaMem2Index_slurm_ArabidopsisEMS.sb: indexes the reference genome for comparison with sequencing files.runBwaMem2Aln_slurm_ArabidopsisEMS.sb: aligns the sequencing files to the reference genome and generates.samfiles that contains the alignment data.runSamtoolsSamToBam_slurm_ArabidopsisEMS.sb: convert alignment data into more efficient.bamformat.runSamtoolsBam_Sort_IndexEMS.sb: sorts and indexes the alignments for future SNP-calling.runSamtoolsMarkDuplicatesEMS.sb: removes all duplicates to ensure efficient and accurate SNP-calling.
Each .sb script will generate a slurm-########.out file that shows the log of commands run in each script. These files are useful to debug any issues that may pop up. In most cases, it will be easy to debug, if not, the HPCC folks can be contacted here.
Each script can be edited if needed by going to https://ondemand.hpcc.msu.edu/, and clicking Home Directory under the Files drop-down. Then, navigate to the desired script and hit EDIT under the three dots drop-down for the file you want to edit. A new tab will allow you to edit the file and you can save the file by hitting the SAVE button on the top left.
All scripts are finished when all 7 slurm-########.out files show up in the scripts folder. If all scripts are successful, there should be multiple non-empty files in the output folder.
A common issue while running the script may be that 1 or more of the data files are corrupted. In this case, the corrupted portion of the corrupted data files must be cut out. In these cases, it's common to manually contatetate three files of each files and put into the input folder. Since data has already been added to the input, script 1 (import_data.sb) can be skipped and the subsequent scripts can be run by running the following:
```bash
bash runScripts.sh data
```
Additionally, the quality of the sequencing files can be tested by running the runFastqc_slurm_EMS.sb script, which outputs HTML files containing plots of different metrics of data quality. This can be run separately by doing the following:
```bash
# navigate to scripts folder
cd ./scripts/
sbatch runFastqc_slurm_EMS.sb
```
Once the script is run, its progress can be checked by the following:
```bash
squeue -lu <username> # replace <> with your MSU username
```
Once all the output files have been successfully generated, copy the output files from the HPCC to the Lab Macbook by doing the following:
- Open up terminal.
- Navigate to desktop by doing
cd ~/Desktop/ scp -r <username>@hpcc.msu.edu:<analysis_folder_name>/output . # replace <>'s with your MSU username and name of analysis folder- Run other lab experiments while files download. (It can take close to an hour)
-
Once all the files have transferred, download Simple to your desktop and unzip. Rename the folder as
Simpleinstead ofSimple-master. Go to thesimple_variables.shfile underSimple/scripts/and change thelinevariable from “EMS” to your desired sample name as done for thelineNamevariable above. -
Next, copy all files in the output folder to the the
Simple/outputfolder. -
Now, in
Simple/scripts/, replace thesimple.shwith thesimple.shfile present in thescriptsfolder of the data analysis repository. The newsimple.shfile can be downloaded using the HPCC GUI as done above.
-
Now, SIMPLE is ready to run by doing the following on the MacBook:
cd ~/Desktop/Simple chmod +x ./scripts/simple.sh ./scripts/simple.sh
Come back in a day and it should be ready. Congrats, now you get to do the fun stuff!