A set of scripts to automate the download and tracking of assemblies and annotations.
This script takes an unzipped ncbi datasets folder and a target location. The ncbi folder is copied into the target folder following the target_folder/species/assembly/file structure. Additionally metadata are saved. Currently the scripts only works with *fna and *gff3 files.
- NCBI dataset API
- python and pandas
An example ncbi comand to get all SAR assemblies is:
# dowload a dataset of multiple species
datasets download genome taxon 2698737 --assembly-level chromosome,complete --annotated --reference --include genome,gff3
# or download a single assembly
datasets download genome accession GCF_028858775.2 --include gff3,genome
# unzip archive
unzip ncbi_dataset.zip
# organize the assemblies in the target folder (this is the folder where the species folder will be created)
python organize_ncbi.py --ncbi ncbi_dataset --target target_folder/
This script takes a folder with the structure of species/assemblies/file and it produces a snapshot to use in case someone wants to recreate the same folder structure. The snapshot is saved in the snapshot/ foder and can be tracked using git.
python get_snapshot.py target_folder