Global database creation using OBITools v.1.2.12 on Brown's high-performance cluster, OSCAR.
This repository can be used to download a global reference library from the European Nucleotide Archive (ENA) (i.e., all available sequences in a given division).
Note: This step only needs to be run if an updated version of a global reference library is required for downstream steps.
The schematic below shows the entire bioinformatic pipeline for DNA metabarcoding data, but the step included in this repository is shown in the white box.
This repository can be used to download a global reference library from the European Nucleotide Archive (ENA) (i.e., all available plant sequences). Post-processing steps use the Python 2.7 version of OBItools to trim down the database to sequences matching the lab's gene/primer use cases.
The Globus service is lightning fast compared to other options, and is the best way to download the DAT files for your taxa division(s) of interest to your scratch directory. In the example for plants (PLN), downloading all the available sequences amount to over 500 GB, which need to be stored in scratch. The European Nucleotide Archive (ENA) hosts a Globus endpoint for us to use for file transfers.
See the Oscar docs to set up your account with Globus.
In the GUI, Choose the EMBL-EBI Public Data Collection and use this path: /pub/databases/ena/sequence/snapshot_latest/std/.
NOTE: Make sure you have the trailing forward slash for Globus to find the right path.
Connect to the BrownU_CCV_Oscar Collection and use this path: /scratch/<username>/<date_taxadiv>/. You can make the <date_taxadiv> folder right in the GUI.
On the left-hand side, use the filter to display all the STD_<taxadiv> DAT files, select all, and start the transfer by hitting the blue Start button. See the image below for an example filter (Note: This image shows files in scratch that have already been transferred):

- If not on campus, make sure you are connected to the Brown VPN
- Navigate to the RStudio Server hosted on Open OnDemand and choose R version 4.3.1.
- Choose 400 hours, 4 cores, 48 GB memory
- Under Modules put
git miniconda3. - Launch the session once it has been allocated.
- Go to the terminal pane in RStudio and
cd /oscar/data/tkartzin/<your folder>(replace with your user folder here) - In that terminal
git clone https://github.com/trklab-metabarcoding/obitools2-global-db.git - Also in the terminal:
cd obitools2-global-db - In the Files panes of RStudio, use the menu at the top right to make sure you are also at the same path.
- Double-click the
.obitools2-global-db.Rprojfile to set the project working directory. All of the notebooks are built from this working directory.
Note: Global reference databases will be stored in the shared lab directory: /oscar/data/tkartzin/global_ref_lib
Note: This step can take approximately 7-8 days to run, so make sure you choose the correct number of hours. Oscar will schedule jobs based on the total number of core-minutes requested, up to 220,000 core minutes, using this formula: 4 cores * 400 hours * 60 minutes/hour = 96,000 core-minutes. Asking for few cores allows us to ask for more time to make sure the ecoPCR step can finish.
Once created, global reference databases will be stored in dated folders in the shared lab directory /oscar/data/tkartzin/global_ref_lib under the appropriate taxonomic division code.
Here is a list of the divisions you can choose from when downloading from ENA:
Note: If Conda is not found when running code chunks, add this line to your .bash_profile in your home directory on Oscar: export PATH=~/gpfs/runtime/opt/anaconda/2022.05/bin:$PATH
The first step is to update all of the params in the YAML header of the first notebook. This includes specifying the taxonomic division, region of interest (e.g. P6), and eco PCR parameters.
Step through each code chunk to make a new dated folder, retrieve the reference library from ENA, and run an in silico PCR.
Note: It will take several hours to download the files from ENA, so this is best treated as a several-day process. We download to the user's scratch directory on Oscar, which has more space and resources to process the files faster. Converting from ENA to ecoPCR also take several hours, so again, we process on scratch and only transfer the final reference database over to data/tkartzin/<ref db folder>/${DATE}.
As you move through the code, you will be able to see the numbers of sequences included at various stages; these numbers can be used in publications etc.
Note: It will take several hours to run the ecoPCR simulation, so this is best treated as a several-day process. We download to the user's scratch directory on Oscar, which has more space and resources to process the files faster. Converting from EMBL to ecoPCR also take several hours, so we process on scratch and only transfer the final reference database over to data/tkartzin/<ref db folder>/${DATE}.
At the end of this step, the output will be moved to a dated folder in the appropriate region-of-interest folder under each taxonomic division at /oscar/data/tkartzin/global_ref_lib/<taxonomic division code>/<region-of-interest>.

