Data preprocessing for a GNExT platform is performed through a Nextflow-based pipeline that enables seamless deployment across different computing environments and automatically ensures scalability for large collections of GWAS summary statistics.
The pipeline provides comprehensive functionality for preparing GWAS summary data for integration into the GNExT platform, including data harmonization, variant annotation using the Ensembl Variant Effect Predictor, and, optionally, the execution of gene-based association analyses with the state-of-the-art tool MAGMA.
To start, clone the repository:
git clone https://github.com/DyHealthNet/gnext_nf_pipeline.git
Execution of the pipeline requires the installation of Java and Nextflow. Depending on the compute environment you select, either Conda, Docker or Singularity have to be installed.
Details on how to install Java and Nextflow can be found here: https://www.nextflow.io/docs/latest/install.html.
For running MAGMA, appropriate reference data are required. The 1000 Genomes Project is commonly employed as the reference population.
Reference data is deposited on Zenodo https://doi.org/10.5281/zenodo.17940903, including PLINK files for GRCh37 and GRCh38 across five super populations as well as Ensembl protein-coding gene annotation files for GRCh37 and GRCh38 from Ensembl releases 114 and 115.
After downloading, the corresponding paths to the reference files should be updated in the nextflow.config file.
Note: For alternative Ensembl versions of the gene annotation files, or for details on how the reference data were generated, please consult the GitHub repository at https://github.com/DyHealthNet/gnext_reference_data.git.
The primary input to the preprocessing pipeline is a CSV/TSV file containing the phenotypes and their corresponding GWAS summary statistics file locations.
Each row represents one phenotype, with the following columns:
| Column | Description |
|---|---|
phenocode |
Unique identifier of the phenotype or trait analyzed in the corresponding GWAS summary statistics file. |
description |
Description of the phenotype or trait analyzed in the corresponding GWAS summary statistics file. |
category |
Group/category of the phenotype or trait analyzed in the corresponding GWAS summary statistics file. |
filename |
Absolute path to the GWAS summary statistics file associated with the phenotype. |
nr_samples |
OPTIONAL: Number of samples included in the GWAS for the respective phenotype, used for downstream analyses such as MAGMA gene-based testing. Can also be included in the GWAS files. |
The column positions for the required fields (chrom, pos, ref, alt, p-value, beta, se, af) can be defined within the study-specific configuration.
Study-specific parameters must be defined in the configuration file configs/study_specific.config.
Computational settings such as memory allocation, runtime limits, and other scheduler-related parameters should be specified in configs/compute_slurm.config.
In addition, a general configuration file configs/base.config is provided to adjust parameters related to Manhattan and top hits generation as well as VEP window distances. These parameters are preconfigured with suitable defaults and typically do not require manual modification.
| Parameter | Example Value | Description |
|---|---|---|
pheno_file |
NA |
Path to the phenotype configuration file containing trait names and corresponding GWAS summary statistic file paths. |
base_dir |
NA |
Path to the base directory containing the GWAS summary statistic files (for docker and singularity volume). |
out_dir |
NA |
Output directory where all processed results and intermediate files will be stored. |
pheno_batch_size |
5 |
Number of phenotypes processed in parallel within a single batch. |
window_up |
10 |
Upstream window size (in kb) used when mapping variants to genes in MAGMA. |
window_down |
10 |
Downstream window size (in kb) used when mapping variants to genes in MAGMA. |
gene_location |
NA |
Path to the Ensembl gene location file used for MAGMA analyses. |
steps |
["gene_statistics", "gwas_exploration"] |
List of workflow steps to be executed (e.g., gene-based statistics, exploratory analysis). |
chr_column |
3 |
Column index (1-based) of the chromosome field in the GWAS summary statistics file. |
pos_column |
4 |
Column index of the variant position field. |
ref_column |
5 |
Column index of the reference allele field. |
alt_column |
6 |
Column index of the alternate allele field. |
pval_column |
16 |
Column index of the p-value field. |
beta_column |
9 |
Column index of the effect size (beta) field. |
se_column |
10 |
Column index of the standard error field. |
af_column |
8 |
Column index of the alternate allele frequency field. |
pval_neglog10 |
false |
Indicates whether p-values are stored as negative log10 values (true) or raw p-values (false). |
ensemblvep_species |
'homo_sapiens' |
Species identifier for Ensembl VEP annotation. |
ensemblvep_genome |
'GRCh37' |
Genome assembly version used for annotation. |
ensemblvep_cache_version |
110 |
Version of the Ensembl VEP cache used during annotation. Must match the installed VEP version. |
ensemblvep_cache |
NA |
Optional. Path to pre-existing VEP cache. If no cache existing, cache will be downloaded automatically. |
magma_reference_plink |
NA |
Path to the PLINK reference panel used by MAGMA for gene-based analyses. |
Note: Environment files are provided for VEP versions 110, 113, 114, and 115. The VEP version installed in Conda must correspond to the VEP cache version. If a different VEP version is required (e.g., to match an existing VEP cache), an additional environment file named vep_X.yml should be created, where X denotes the desired VEP version.
Examples: We provide two study-specific example configuration files that have been preconfigured for setting up two GNExT instances. The configs/olfaction.config and configs/panukbb.config are example study-specific configuration files for the two GNExT instances accessible at http://olfaction.gnext.gm.eurac.edu/ and http://panukbb.gnext.gm.eurac.edu/. For more information on these instances and the data behind the platforms, please refer to our publication.
| Parameter | Example / Default Value | Description |
|---|---|---|
slurm_queue |
slow-mc2 |
Defines the SLURM queue to be used when profile = slurm. |
global_maxForks |
10 |
Defines the global maximum number of processes that can run in parallel across the workflow. |
normalize_cpus |
12 |
Number of CPU cores allocated for the normalization step. |
normalize_memory |
'64GB' |
Memory allocated for the normalization step. |
vcf_cpus |
16 |
Number of CPU cores allocated for variant annotation and processing (e.g., VEP execution). |
vcf_memory |
'64GB' |
Memory allocated for VCF processing and annotation. |
manhattan_qq_cpus |
12 |
Number of CPU cores used for generating Manhattan and QQ plots. |
manhattan_qq_memory |
'64GB' |
Memory allocated for the Manhattan and QQ plot generation step. |
chrom_bgz_maxForks |
23 |
Maximum number of chromosome compression tasks (bgzip) that can run simultaneously. |
chrom_bgz_memory |
'64GB' |
Memory allocated for per-chromosome compression tasks. |
chrom_bgz_cpus |
8 |
Number of CPU cores allocated for per-chromosome compression (bgzip) tasks. |
magma_cpus |
32 |
Number of CPU cores allocated for MAGMA gene-based analyses. |
magma_memory |
'64GB' |
Memory allocated for MAGMA analyses. |
magma_input_cpus |
32 |
Number of CPU cores allocated for preparing MAGMA input files. |
magma_input_memory |
'64GB' |
Memory allocated for MAGMA input file preparation. |
These parameters are typically not intended to be modified.
| Parameter | Example / Default Value | Description |
|---|---|---|
manhattan_num_unbinned |
500 |
Number of unbinned variants displayed in the Manhattan plot to preserve the most significant points without binning. |
manhattan_peak_max_count |
500 |
Maximum number of peaks (significant loci) displayed in the Manhattan plot for readability and performance. |
manhattan_peak_pval_threshold |
1e-6 |
P-value significance threshold used for identifying peaks in the Manhattan plot. Variants below this value are considered significant. |
manhattan_peak_sprawl_dist |
200_000 |
Minimum genomic distance (in base pairs) required to distinguish separate peaks in the Manhattan plot. Peaks closer than this are merged. |
top_hits_pval_cutoff |
1e-6 |
P-value threshold used to select top associated variants for downstream analysis. |
top_hits_max_limit |
10,000 |
Maximum number of top associated variants reported or exported after filtering by p-value. |
ensemblvep_distance_up |
5,000 |
Upstream distance (in base pairs) used by Ensembl VEP when mapping variants to nearby genes. |
ensemblvep_distance_down |
5,000 |
Downstream distance (in base pairs) used by Ensembl VEP when mapping variants to nearby genes. |
Once all parameters have been correctly specified, the pipeline can be executed.
We support execution through Conda, Docker, or Singularity environments (e.g., -profile conda, -profile docker) and enable job scheduling via SLURM (e.g., -profile slurm,conda, -profile slurm,docker).
nextflow run main.nf -profile slurm,dockerTo resume the workflow:
nextflow run main.nf -profile slurm,docker -resumeThe Nextflow pipeline generates symbolic links in the output directory. Therefore, as a final step, navigate to the output directory and execute the following command:
find . -type l -exec sh -c '
target=$(readlink -f "$1")
if [ -f "$target" ]; then
echo "Replacing symlink with real file: $1"
cp --remove-destination "$target" "$1"
else
echo "Skipping broken symlink: $1 -> $target"
fi
' _ {} \;To extend an existing run with additional traits, the previous execution must have completed successfully, and the corresponding work directory containing the intermediate results must still be available.
Only change the pheno_file, append columns at the end (not in between rows), and run the workflow:
nextflow run main.nf -profile slurm,docker -resumeBridging the gap between genome-wide association studies and network medicine with GNExT Lis Arend, Fabian Woller, Bastienne Rehor, David Emmert, Johannes Frasnelli, Christian Fuchsberger, David B. Blumenthal, Markus List bioRxiv 2026.01.30.702559; doi: https://doi.org/10.64898/2026.01.30.702559
