-
Notifications
You must be signed in to change notification settings - Fork 6
Apple phased assembly
Here we will describe how to obtain a phased assembly of Malus domestica cv. Fuji with colora.
We are going to use the data from the BioProject PRJNA814760 from the cv Fuji. Here is the relative paper, and here the SRA accessions.
To download the data, we will use sra tools that can be installed with conda/mamba.
mkdir colora_paper
cd colora_paper/
mkdir apple
cd apple
# hifi reads
mkdir hifi
cd hifi
screen -S apple
mamba activate sra-tools
prefetch SRR18311511 --max-size 60GB # here we have to specify 60GB because the standard limit is 20GB
fasterq-dump SRR18311511
# hic reads
cd ..
mkdir hic
cd hic
prefetch SRR18311512 --max-size 50GB
fasterq-dump SRR18311512
cd ~/colora_paper/apple
git clone https://github.com/LiaOb21/colora.git
cd ~/colora_paper/apple/colora
mkdir resources
cd ~/colora_paper/apple/hifi
gzip SRR18311511.fastq
cd ~/colora_paper/apple/hic
gzip SRR18311512_1.fastq
gzip SRR18311512_2.fastq
cd ~/colora_paper/apple/colora/resources
mkdir raw_hifi
cd raw_hifi
ln -s ~/colora_paper/apple/hifi/SRR18311511.fastq.gz
cd ~/colora_paper/apple/colora/resources
mkdir raw_hic
cd raw_hic
ln -s ~/colora_paper/apple/hic/SRR18311512_1.fastq.gz
ln -s ~/colora_paper/apple/hic/SRR18311512_2.fastq.gz
cd ~/software
git clone https://github.com/c-zhou/OatkDB.git
cd ~/colora_paper/apple/colora/resources
mkdir oatkDB
cd oatkDB
ln -s ~/software/OatkDB/v20230921/embryophyta_mito.fam
ln -s ~/software/OatkDB/v20230921/embryophyta_mito.fam.h3f
ln -s ~/software/OatkDB/v20230921/embryophyta_mito.fam.h3i
ln -s ~/software/OatkDB/v20230921/embryophyta_mito.fam.h3m
ln -s ~/software/OatkDB/v20230921/embryophyta_mito.fam.h3p
ln -s ~/software/OatkDB/v20230921/embryophyta_pltd.fam
ln -s ~/software/OatkDB/v20230921/embryophyta_pltd.fam.h3f
ln -s ~/software/OatkDB/v20230921/embryophyta_pltd.fam.h3i
ln -s ~/software/OatkDB/v20230921/embryophyta_pltd.fam.h3m
ln -s ~/software/OatkDB/v20230921/embryophyta_pltd.fam.h3p
We will then set in our config.yaml (N.B., according to the data provided in the paper, the coverage should be around 81x. We should then set c as equal to ~81x5):
oatk:
k: 1001 # kmer size [1001]
c: 405 # minimum kmer coverage [3]
m: "resources/oatkDB/embryophyta_mito.fam" # mitochondria gene annotation HMM profile database [NULL]
optional_params:
"-p": "resources/oatkDB/embryophyta_pltd.fam" # to use for species that have a plastid db
For M. domestica, the closest busco database is Eudicots.
Go to https://busco-data.ezlab.org/v5/data/lineages/ and download the database of interest.
cd ~/colora_paper/apple/colora/resources
mkdir busco_db
cd busco_db
wget https://busco-data.ezlab.org/v5/data/lineages/eudicots_odb10.2024-01-08.tar.gz
tar -xzf eudicots_odb10.2024-01-08.tar.gz
We will then set in our config.yaml:
busco:
lineage: "resources/busco_db/eudicots_odb10" # lineage to be used for busco analysis
This database needs 500GB of memory, so just download it once and symlink it when needed. If you don't have enough memory, you can skip the decontamination step and avoid the download of the database.
Here is how you can easily obtain the database:
mamba create -n ncbi_fcsgx ncbi-fcs-gx
mamba activate ncbi_fcsgx
cd colora/resources
mkdir gxdb
cd gxdb
sync_files.py get --mft https://ftp.ncbi.nlm.nih.gov/genomes/TOOLS/FCS/database/latest/all.manifest --dir ./gxdb
Another thing we need for fcs-gx is the tax id of our species. For M. domestica the ncbi tax id is 3750.
We will then set in our config.yaml:
fcsgx:
ncbi_tax_id: 3750
path_to_gx_db: "resources/gxdb"
This input is optional; you need it only if you want to compare your assembly with the reference for your species. Link to the reference genome: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_002114115.1/
cd ~/colora_paper/apple/colora/resources
mkdir reference
cd reference
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/002/114/115/GCA_002114115.1_ASM211411v1/GCA_002114115.1_ASM211411v1_genomic.fna.gz
We will then set in our config.yaml:
# Customisable parameters for quast
quast:
optional_params:
"--fragmented": ""
"--large": ""
"-r": "resources/reference/GCA_002114115.1_ASM211411v1_genomic.fna.gz" #reference genome (fasta)
This is how our config.yaml should look like at the end (see colora/config/README.md for further details):
# config.yaml for real data
# Set memory and threads for high demanding rules
high:
mem_mb: 307200 # memory in MB
t: 50 # number of threads
# Set memory and threads for medium demanding rules
medium:
mem_mb: 20480 # memory in MB
t: 20 # number of threads
# Set memory and threads for low demanding rules
low:
mem_mb: 10240 # memory in MB
t: 8 # number of threads
# Path to hifi reads
hifi_path: "resources/raw_hifi/"
# Path to hic reads
hic_path: "resources/raw_hic/"
# Customisable parameters for kmc
kmc:
k: 21 # kmer size, it will be the same used for genomescope2
ci: 1 # exclude k-mers occurring less than <value> times (default: 2)
cs: 1000000 #maximal value of a counter (default: 255)
# Customisable parameters for kmc_tools transform
kmc_tools:
cx: 1000000 # exclude k-mers occurring more of than <value> times
# Customisable parameters for genomescope2
genomescope2:
optional_params:
"-p": "2"
"-l": ""
# Customisable parameters for oatk
oatk:
k: 1001 # kmer size [1001]
c: 405 # minimum kmer coverage [3]
m: "resources/oatkDB/embryophyta_mito.fam" # mitochondria gene annotation HMM profile database [NULL]
optional_params:
"-p": "resources/oatkDB/embryophyta_pltd.fam" # to use for species that have a plastid db
# Customisable parameters for fastp
fastp:
optional_params:
"--cut_front": False # set to True for Arima Hi-C library prep kit generated data
"--cut_front_window_size": "" # set to 5 for Arima Hi-C library prep kit generated data
# Customisable parameters for hifiasm
hifiasm:
phased_assembly: True # set to true if you want to obtain a phased assembly
optional_params:
"-f": "" # used for small datasets
"-l": "" # purge level. 0: no purging; 1: light; 2/3: aggressive [0 for trio; 3 for unzip]
"--ul": "" # use this if you have also ont data you want to integrate in your assembly
#Set this to False if you want to skip the fcsgx step:
include_fcsgx: True #inlcude this rule only if you have previously downloaded the database (recommended to run fcsgx only on a HPC. It requires around 500 GB of space on your disk and a large RAM)
# Customisable parameters for fcsgx
fcsgx:
ncbi_tax_id: 3750
path_to_gx_db: "resources/gxdb"
# Set this to False if you want to skip purge_dups steps:
include_purge_dups: False
# Customisable parameters for arima mapping pipeline:
arima:
MAPQ_FILTER: 10
# Customisable parameters for yahs
yahs:
optional_params:
"-e": "" # you can specify the restriction enzyme(s) used by the Hi-C experiment
# Customisable parameters for quast
quast:
optional_params:
"--fragmented": ""
"--large": ""
"-r": "resources/reference/GCA_002114115.1_ASM211411v1_genomic.fna.gz" #reference genome (fasta)
# Customisable parameters for busco
busco:
lineage: "resources/busco_db/eudicots_odb10" # lineage to be used for busco analysis
N.B.: we set phased_assembly: True because we are dealing with a heterozygous species and we want two separate assemblies for the two haplotypes. In this case, it's mandatory to set include_purge_dups: False.
At this point, we should be able to run the pipeline with a really simple command:
cd ~/colora_paper/apple/colora/
screen -r apple # if you want to run it in the background
mamba activate snakemake
snakemake --software-deployment-method conda
N.B. if you are using a server where jobs are normally submitted through SLURM or other schedulers, you might consider setting up a snakemake profile in your system to handle job submission.