obitools2-global-db

Global database creation using OBITools v.1.2.12 on Brown's high-performance cluster, OSCAR.

This repository can be used to download a global reference library from the European Nucleotide Archive (ENA) (i.e., all available sequences in a given division).

Note: This step only needs to be run if an updated version of a global reference library is required for downstream steps.

The schematic below shows the entire bioinformatic pipeline for DNA metabarcoding data, but the step included in this repository is shown in the white box.

This repository can be used to download a global reference library from the European Nucleotide Archive (ENA) (i.e., all available plant sequences). Post-processing steps use the Python 2.7 version of OBItools to trim down the database to sequences matching the lab's gene/primer use cases.

Start the download using Globus GUI

The Globus service is lightning fast compared to other options, and is the best way to download the DAT files for your taxa division(s) of interest to your scratch directory. In the example for plants (PLN), downloading all the available sequences amount to over 500 GB, which need to be stored in scratch. The European Nucleotide Archive (ENA) hosts a Globus endpoint for us to use for file transfers.

See the Oscar docs to set up your account with Globus.

In the GUI, Choose the EMBL-EBI Public Data Collection and use this path: /pub/databases/ena/sequence/snapshot_latest/std/. NOTE: Make sure you have the trailing forward slash for Globus to find the right path.

Connect to the BrownU_CCV_Oscar Collection and use this path: /scratch/<username>/<date_taxadiv>/. You can make the <date_taxadiv> folder right in the GUI.

On the left-hand side, use the filter to display all the STD_<taxadiv> DAT files, select all, and start the transfer by hitting the blue Start button. See the image below for an example filter (Note: This image shows files in scratch that have already been transferred):

Connecting to Oscar

Note: Global reference databases will be stored in the shared lab directory: /oscar/data/tkartzin/global_ref_lib Note: This step can take approximately 7-8 days to run, so make sure you choose the correct number of hours. Oscar will schedule jobs based on the total number of core-minutes requested, up to 220,000 core minutes, using this formula: 4 cores * 400 hours * 60 minutes/hour = 96,000 core-minutes. Asking for few cores allows us to ask for more time to make sure the ecoPCR step can finish.

Global reference library storage

Once created, global reference databases will be stored in dated folders in the shared lab directory /oscar/data/tkartzin/global_ref_lib under the appropriate taxonomic division code.

Here is a list of the divisions you can choose from when downloading from ENA:

Running the Notebook for Step 2a

Note: If Conda is not found when running code chunks, add this line to your .bash_profile in your home directory on Oscar: export PATH=~/gpfs/runtime/opt/anaconda/2022.05/bin:$PATH

Step 2a. `Step2a_global_reference_library.Rmd`

The first step is to update all of the params in the YAML header of the first notebook. This includes specifying the taxonomic division, region of interest (e.g. P6), and eco PCR parameters.

Step through each code chunk to make a new dated folder, retrieve the reference library from ENA, and run an in silico PCR.

Note: It will take several hours to download the files from ENA, so this is best treated as a several-day process. We download to the user's scratch directory on Oscar, which has more space and resources to process the files faster. Converting from ENA to ecoPCR also take several hours, so again, we process on scratch and only transfer the final reference database over to data/tkartzin/<ref db folder>/${DATE}.

As you move through the code, you will be able to see the numbers of sequences included at various stages; these numbers can be used in publications etc.

Note: It will take several hours to run the ecoPCR simulation, so this is best treated as a several-day process. We download to the user's scratch directory on Oscar, which has more space and resources to process the files faster. Converting from EMBL to ecoPCR also take several hours, so we process on scratch and only transfer the final reference database over to data/tkartzin/<ref db folder>/${DATE}.

Output

At the end of this step, the output will be moved to a dated folder in the appropriate region-of-interest folder under each taxonomic division at /oscar/data/tkartzin/global_ref_lib/<taxonomic division code>/<region-of-interest>.

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
images		images
.gitignore		.gitignore
README.md		README.md
Step2a_global_reference_library.Rmd		Step2a_global_reference_library.Rmd
globus_example.png		globus_example.png
obitools2-global-db.Rproj		obitools2-global-db.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

obitools2-global-db

Start the download using Globus GUI

Connecting to Oscar

Global reference library storage

Running the Notebook for Step 2a

Step 2a. `Step2a_global_reference_library.Rmd`

Output

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

obitools2-global-db

Start the download using Globus GUI

Connecting to Oscar

Global reference library storage

Running the Notebook for Step 2a

Step 2a. Step2a_global_reference_library.Rmd

Output

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Step 2a. `Step2a_global_reference_library.Rmd`

Packages