Quality Control Workflow Pangenomes

Introduction

License

All code in this repository is released under the MIT license.

Context

The code in this repository was created within the context of the EPAN ELIXIR BFSP project.
ELIXIR Europe (https://elixir-europe.org/) is an intergovernmental organisation that brings together life science resources from across Europe.
Within the ELIXIR organization, there are multiple science tiers, once of which is BFSP: Biodiversity, Food Security, and Pathogens (https://elixir-europe.org/how-we-work/scientific-programme/science/bfsp).
The EPAN project (Enhancing Pan-genome Analysis in Plants) is a project funded by ELIXIR (https://elixir-europe.org/how-we-work/scientific-programme/science/bfsp/e-pan).

Overview

One of the key issues when dealing with genome data is divergent quality between different genome entries.
This issue can be compounded by differences in used technology for sequencing, different algorithms used for assembly and scaffolding, and different programs used for gene annotation.
Within the project we are mostly focusing on how gene-space behaves within the context of pangenomics.
Therefore, this Quality Control (QC) workflow will only deal with gene quality assessments.
However, the workflow can easily be extended to deal with sequence quality assessments as well.

Dependencies

The workflow is based on Nextflow (https://www.nextflow.io/), which is the primary dependency (Nextflow in itself depends on Java, please check the installation options of Nextflow).
In addition, the workflow makes heavy use of other existing bioinformatics tools for gene quality assessment, such as BUSCO and OMARK.
These bioinformatics tools are loaded automatically as Singularity containers, and do as such not need to be installed directly by the user.
For more information about these tools:

BUSCO : https://busco.ezlab.org/
OMARK : https://github.com/DessimozLab/OMArk

Usage

General

The QC workflow generally looks like this:

nextflow run code/epan_qc.nf --pan ../settings/oryza_sativa.json --qc ../settings/qc.json --outdir ../output/

The workflow has 4 possible parameters:

--pan PANGENOME-CONFIG-FILE : Required Provide the JSON configuration file that contains information about the pangenome. More details about the content of this file can be found in the next section(s).
--qc QC-CONFIG-file : Required Provide the JSON configuration file that contains information about which steps the QC analysis should be using. More details about the content of this file can be found in the next section(s).
--outdir OUTPUT-DIR : Required Path to the directory where the output files will be placed
--tmp TMP-DIR : Optional Path to the directory where temporary data files will be placed. This includes data downloaded by BUSCO/OMARK as reference databases. If not provided, it defaults to a ./tmp/ subdirectory in the output directory.

Note: If relative paths (such as ./output/) do not work, try using absolute paths. Nextflow can be quite picky when trying to interpret them.

Pangenome configuration file

Overview

The pangenome configuration file contains all data related to the genomes that make up a pangenome. This example is a pangenome for Oryza sativa, with 2 constituent genomes:

{
    "storage": {
        "path": "/data/genomes/oryza_sativa/"
    },
    "pangenome": {
        "species": {
            "name": "oryza sativa",
            "tax_id": "4530"
        },
        "genomes": [
            {
                "name": "02428",
                "directory": "./02428/v1.0/",
                "cds": "02428.IGDBv1.Allset.cds.fasta",
                "gff": "02428.IGDBv1.Allset.gff",
                "proteins": "02428.IGDBv1.Allset.pros.fasta",
                "genome": "02428.genome"
            },
            {
                "name": "9311",
                "directory": "./9311/v1.0/",
                "cds": "9311.IGDBv1.Allset.cds.fasta",
                "gff": "9311.IGDBv1.Allset.gff",
                "proteins": "9311.IGDBv1.Allset.pros.fasta",
                "genome": "9311.genome"
            }
		]
	}
}

The storage.path entry points to the path where the data for the genomes is stored.
The pangenome entry contains information about the pangenome, with pangenome.species containing global information, and pangenome.genomes being a list of the genomes that make up the pangenome.
This list consists of simple object entries, each containing information about a single genome. These object entries consist of the following fields:

name : The name of the genome entry, likely the name of a cultivar/ecotype.
directory : Location, relative to storage.path, where the data for the genome entry is located.
cds : File name of the CDS FASTA file for the genome entry.
gff : File name of the GFF file for the genome entry.
proteins : File name of the protein FASTA file for the genome entry.
genome : File name of the genome FASTA file for the genome entry.

Note that not all of the data content is required to be present, depending on the type(s) of QC being executed.
For example: BUSCO and OMARK only require the protein file as input. As such, each genome entry should contain either the proteins key, or the cds key (the CDS will in that case be automatically translated).

Additionally, it is important to note that the final paths are constructed based on the combination of storage.path, the directory of the genome entry, and the file-name of the data entry.
For example: the CDS file of cultivar 9311 is located at /data/genomes/oryza_sativa/9311/v1.0/9311.IGDBv1.Allset.cds.fasta
One of the initial steps in the workflow is a safety check to see whether the derived file-paths exist on the system where the workflow is executed. If any of them fail, the workflow will he halted until the error is corrected.

Creating the configuration file

As creating this file by hand can be quite tedious for large pangenomes, there is a PHP utility script in the codebase available that can help: ./util/generate_pangenome_json.php.
This script can automatically generated the required content of the JSON configuration file, based on the content of a directory, if the pangenome data is stored in a consistent manner.
I.e. it assumes that the genome entries of the pangenome are all in their own subdirectory, it assumes that there is a common suffix for the CDS fasta files, etc.

The script has the following options:

--input INPUT-DIRECTORY : Required Main location of the pangenome data. Corresponds with the storage.path variable set in the JSON configuration file. All info is solely derived by scanning the file-names in this directory.
--output JSON-CONFIG-FILE : Required Path to where the JSON configuration file of the pangenome will be written to.
--species SPECIES : Required Name of the species of the pangenome.
--taxid TAXID : Required Taxonomy identifier of the pangenome species (see https://www.ncbi.nlm.nih.gov/taxonomy for more information)
--subdir SUBDIRNAME : Optional Data for each genome is located in a fixed subdirectory. E.g. An1/V1/An1.cds.fasta --> V1 is a fixed subdirectory consistently present for all genome entries.
--cds CDS-SUFFIX : Optional Suffix used to recognize the CDS FASTA files.
--proteins PROT-SUFFIX : Optional Suffix used to recognize the protein FASTA files.
--gff GFF-SUFFIX : Optional Suffix used to recognize GFF files.
--genome GENOME-SUFFIX : Optional Suffix used to recognized genome FASTA files.

QC configuration file

Overview

Rather than having to provide a multitude of options when running the workflow in order to enable/disable certain settings, this workflow uses a JSON configuration file for the QC settings as well.
This is an example QC configuration file:

{
	"qc":{
		"busco":{
			"enabled":true,
			"settings":{
				"database":"viridiplantae_odb10"
			}
		},
		"omark":{
			"enabled":true,
			"settings":{
				"url":"https://omabrowser.org/All/",
				"database":"Viridiplantae",
			}
		}
	},
	"phylogeny":{
		"enabled":true,
		"data":"omark",
		"diamond":"--faster --evalue 0.005 --quiet"
	}
}

The QC configuration file can as such be easily expanded to include other workflow tasks, such as phylogeny construction of the pangenome.
Currently, the configuration file contains the following key subworkflows:

qc: The various settings for QC of the pangenome. Settings are self-explanatory.
phylogeny : Settings for creating the phylogenetic tree for the genomes in the pangenome.

Creating the configuration file

The easiest way is to make a copy of the default JSON file, adapt it where necessary, and provide it to the workflow.

Results

The workflow will output a variety of files in the designated output directory, with the file-names based on the species defined in the input JSON file.
This output will consist of QC files (in both JSON format and TAB-delimited format), and optionally of a constructed phylogenetic tree.
For example, for input species Oryza sativa and output-directory ./output/, the resulting QC files will be:

./output/qc_results.oryza_sativa.tsv : The tab-delimited output file of the QC results
./output/qc_results.oryza_sativa.json : The JSON output file of the QC results

TSV output

## Genomes QC: oryza sativa
#Name        BUSCO:present   BUSCO:missing      OMARK:present   OMARK:missng
9311         420             5                  -1              -1
02428        419             6                  -1              -1
Basmati1     410             15                 -1              -1


## Aggregated QC results: oryza sativa
QC-type Average Median  Minimum Maximum
BUSCO   416     419     410     420
OMARK   TBD     TBD     TBD     TBD

JSON output

{
    genomes":{
        "9311":{
            "BUSCO":{"present":420,"missing":5},
            "OMARK":{"present":-1,"missing":-1},
        },
        "02428":{
            "BUSCO":{"present":419,"missing":6},
            "OMARK":{"present":-1,"missing":-1},
        },
        "Basmati1":{
            "BUSCO":{"present":410,"missing":15},
            "OMARK":{"present":-1,"missing":-1},
        }
    },
    "aggregated":{
        "BUSCO":{"average":416,"median":419,"minimum":410,"maximum":420},
        "OMARK":{"average":-1,"median":-1,"minimum":-1,"maximum":-1},
    }
}

Trouble shooting

Currently, Nextflow is setup to use Singularity containers to run external programs (e.g. BUSCO, OMARK, etc. ).
The configuration should in theory allow for seemless automatic downloading of the container images (putting them in work/singularity/ for future re-use), and running the software in the containers with the data provided.
However, one issue that is encountered on network shares, is that Nextflow does not always make alle network share mounts available to these Singularity containers.
This will be quite obvious from the error message produced:

  2026-02-23 17:27:55 INFO::    Input file is /biocomp/data/genomes/oryza_sativa/02428/v1.0/02428.IGDBv1.Allset.pros.top10.fasta
  2026-02-23 17:27:55 ERROR:    /biocomp/data/genomes/oryza_sativa/02428/v1.0/02428.IGDBv1.Allset.pros.top10.fasta does not exist

This error can be circumvented by manually changing the nextflow.config file, and altering the singularity entry, by inserting the runOptions = "--bind PATH" entry. Whereas the default Nextflow configuration file looks like

singularity {
	enabled = true
}

The altered Nextflow configuration file can look like

singularity {
	enabled = true
    runOptions = "--bind /biocomp/data/genomes/oryza_sativa/" 
}

This ensures that Nextflow mounts the /biocomp/data/genomes/oryza_sativa/ network location for the Singularity containers, making the data available.
Note that there can be multiple --bind values set within the same runOptions setting:

singularity {
	enabled = true
    runOptions = "--bind /biocomp/data/genomes/oryza_sativa/ --bind /other/data/location/ --bind /third/data/location/" 
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
code		code
pangenomes_config		pangenomes_config
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Quality Control Workflow Pangenomes

Introduction

License

Context

Overview

Dependencies

Usage

General

Pangenome configuration file

Overview

Creating the configuration file

QC configuration file

Overview

Creating the configuration file

Results

TSV output

JSON output

Trouble shooting

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Quality Control Workflow Pangenomes

Introduction

License

Context

Overview

Dependencies

Usage

General

Pangenome configuration file

Overview

Creating the configuration file

QC configuration file

Overview

Creating the configuration file

Results

TSV output

JSON output

Trouble shooting

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages