11 Feb 2026 : prescored files have been moved to a new temporary location and are now available for download.
NCBoost is a pathogenicity score of non-coding variants to be used in the study of Mendelian diseases. It is based on supervised learning on a comprehensive set of ancient, recent and ongoing purifying selection signals in humans. NCBoost was trained on a collection of 2336 high-confidence pathogenic non-coding variants associated with monogenic Mendelian diseases. NCBoost performs consistently across diverse independent testing data sets and outperforms other existing reference methods. Further information can be found at the NCBoost 2 paper.
Of note, the NCBoost software can score any type of genomic position, provided that the required puryfing selection features used by the model are available. However, it is important to realize that, among the set of high-confidence pathogenic non-coding variants that were used to train NCBoost, more than 98% were found at proximal cis-regulatory regions, with only 27 variants overlapping more distal intergenic regions. Thus, for consistency with the training process, the most reliable genomic contexts for the use of the NCBoost score are the proximal cis-regulatory regions of protein-coding genes.
We precomputed the NCBoost score for all variants at 1.88 billion non-coding genomic positions overlapping intergenic, intronic, 5'UTR, 3'UTR, upstream and downstream regions -i.e. closer than 1kb from the Transcription Start Site (TSS) and the Transcription End Site (TES), respectively- associated with a background set of 19433 protein-coding genes for which we could retrieve annotation features. Variant mapping and annotation of non-coding genomic positions was done through ANNOVAR software using the gene-based annotation option based on RefSeq (assembly version hg38). In the case of positions overlapping several types of regions associated with different genes and transcripts (either coding or non-coding), a number of criteria were adopted as described in the NCBoost 2 paper.
The precomputed hg38 NCBoost 2 scores in proximal cis-regulatory regions of protein-coding genes can be downloaded as:
- tabix file (.gz, 100Go), and index file (gz.tbi), using wget:
wget https://nginx.sogam.org/files/ncboost_v2_hg38_20260202_light.tsv.gz
wget https://nginx.sogam.org/files/ncboost_v2_hg38_20260202_light.tsv.gz.tbi
- parquet file (.parquet, 101Go)
wget https://nginx.sogam.org/files/ncboost_v2_hg38_20260202_light.parquet
These files contains the following columns:
chr, chromosome name, as [1:22,X,Y]
pos, 1-based genomic position (GrCh38 genome assembly)
region, type of non-coding region overlapping the position, as provided by ANNOVAR (see above)
closest_gene_name, name of the associated protein-coding gene
NCBoost_Score, NCBoost 2 score. NCBoost score ranges from 0 to 1. The higher the score, the higher the pathogenicity potential of the position.
NCBoost_chr_rank_perc, chromosome-wise rank percentile (ranging from 0 to 1) of the corresponding NCBoost 2 score. The higher the rank percentile, the higher the pathogenic potential of the position.
We advise the users to use the NCBoost_chr_rank_perc rather than NCBoost_score, as it is more comparable across chromosomes.
The NCBoost gene database, integrating several database identifiers (Ensembl, HGNC, NCBI), OMIM disease-gene status and the gene-level conservation features used in this work for more than 19433 protein-codign genes are available here.
The NCBoost software is also provided in this repository in case you are interested in training the NCBoost framework on your own variants or your own features, or assessing the NCBoost scores for genomic positions other than those included in the precomputed files.
The following sections will guide you through the steps needed for the annotation of variants, training and execution of NCBoost-2 pretrained models to obtain the pathogenicity score.
If you only want to add NCBoost score to your variants, follow steps 1 to 5. If you want to retrain NCBoost or add all NCBoost features to your variants, follow steps 1 to 2 and 6 to 8.
NCBoost models are stored as github large file object, and require the installation of github-lfs. On Linux, github-lfs can be easily installed by running:
sudo apt install git-lfs
NCBoost scripts and associated data may then be cloned from the NCBoost github repository:
git clone https://github.com/RausellLab/NCBoost-2.git
cd NCBoost-2
Check that the models were downloaded properly by running:
git lfs ls-files
which should output the 10 toy models and 10 ncboost models.
The required python libraries are detailed in libraries.sh and can be installed using conda & pip as follows:
conda create --name ncboost2 python=3.10.14
conda activate ncboost2
bash libraries.sh
Alternatively, a .yml file containing all conda & pip libraries is also available and can be installed as follows:
conda env create --name ncboost2 --file=ncboost2.yml
Download the tabix-indexed and index files as describe below, and move them to data/prescored_WG:
wget https://nginx.sogam.org/files/ncboost_v2_hg38_20260202_light.tsv.gz
wget https://nginx.sogam.org/files/ncboost_v2_hg38_20260202_light.tsv.gz.tbi
mv ncboost_v2_hg38_20260202_light.tsv.* data/prescored_WG/
Input SNVs can be fed to NCBoost as either .tsv or .vcf files:
Variants have to be reported in 1-based, GrCh38 genomic coordinates. The required variant file is a tab-delimited textfile with column headers following the format:
chr pos ref alt
1 12589 G A
The chr column should not contain the 'chr' prefix. Any other columns can be added in addition to the required four columns.
Variants have to be reported in 1-based, GrCh38 genomic coordinates. The CHROM column should not contain the 'chr' prefix.
Make sure that you are running the scripts from the root of this folder (NCBoost-2/).
NCBoost can score files following the format specified just above.
Don't forget to first select the ncboost2 environment before running the script in jupyter:
conda activate ncboost2
Then run:
python src/ncboost_annotate_tsv.py /path/to/tsv/file /data/ncboost_v2_prescored
example:
python src/ncboost_annotate_tsv.py data/testing/testing_data.tsv data/ncboost_v2_prescored
NCBoost's score per-chromosome rank percentile will be added as an extra column to the file.
NCBoost can also score single-row bi-allelic vcf files. Single-row multi-allelic loci vcfs should be parsed using bcftools first.
Don't forget to first select the ncboost2 environment before running the script in jupyter:
conda activate ncboost2
Then run:
python src/ncboost_annotate_vcf.py /path/to_vcf/file /data/ncboost_v2_prescored
example:
python src/ncboost_annotate_vcf.py data/testing/testing_data.vcf data/ncboost_v2_prescored
NCBoost's score per-chromosome rank percentile will be added at the end of the INFO field of each variant.
NCBoost v2 features for all possible SNVs at 1,879,856,949 positions are available here (total size = 260Go) as per-chromosome compressed tabix-indexed files.
wget https://nginx.sogam.org/files/ncboost_v2_hg38_20260202_full.tar.gz
Unpack the tar file and move the data to the data/ folder:
tar -zxvf ncboost_v2_hg38_20260202_full.tar.gz
mv -r ncboost_v2_hg38_20260202_full data/ncboost_v2_prescored/
Complete details about each features are available at NCBoost 2 paper.
NCBoost framework can be trained using the ncboost_train.ipynb script. It loads and annotate a set of pathogenic variants and the corresponding set of region-matched random common variants, train the 10 models and produce the corresponding feature importance plot, as well as ROC and PR curves. The trained models are saved and can be used for later scoring.
The annotation requires to download the full set of features used by NCBoost (260 Go). For convenience, we also provide the set of pathogenic and common variants already annotated with NCBoost features, so that re-training does not force one to download the feature file.
Don't forget to first select the ncboost2 environment before running the script in jupyter:
conda activate ncboost2
ncboost_train.ipynb should be run through a jupyter notebook environment, while the ncboost_train.py script should be run as follows:
python src/ncboost_train.py
NCBoost framework can be applied to annotate and score any variant using the jupyter notebook ncboost_test.ipynb or its python version equivalent, ncboost_test.py. It will apply the trained framework used to generate the results in NCBoost 2 paper. The annotation requires to download the full set of features used by NCBoost (260Go).For convenience, we also provide a set of pathogenic and common variants already annotated with NCBoost features, so that scoring does not force one to download the feature file for the corresponding variants.
Don't forget to first activate the environment:
conda activate ncboost2
ncboost_test.ipynb should be run through a jupyter notebook environment, while the ncboost_test.py script should be run as follows:
python src/ncboost_test.py path/to/input/file.tsv
Example
python src/ncboost_test.py data/testing/testing_data.tsv
The output file will be created in the same folder as the input file, as a tab-delimited text file with the following columns:
the chromosome, position, reference and alternative allele of the variant, the name and Ensembl Gene ID of the nearest gene to which the variant was associated and the corresponding non-coding region (upstream, downstream, UTR5, UTR3, intronic and intergenic), the gene type and 11 gene-based features (slr_dnds, gene_age, pLI, zscore_mis, zscore_syn, loeuf, GDI, ncRVIS,
ncGERP, RVIS_percentile, pcGERP), using a reference of 19433 protein-coding genes, 6 one-hot encoded non-coding region types, 18 features extracted from CADD annotation files [2] (GC, CpG, pri/mam/verPhCons, pri/mam/verPhyloP, GerpN, GerpS, GerpRS, GerpRSpval, ZooPriPhyloP, ZooVerPhyloP, bStatistic, ZooRoCC, ZooUCE, Roulette-AR), the 9 MAF from GnomAD [3] (mean_MAF, mean_MAF_AFR/AMI/AMR/ASJ/EAS/FIN/MID/NFE/SAS), the CDTS score, the max SpliceAI score at the variant and at te position level [4], the substitution-encoding features and the NCBoost score and the extra columns provided by the user in the input file.
NCBoost score range from 0 to 1 (the higher the value, the higher the predicted pathogenicity).
More information about the can be found in NCBoost 2 paper.
1: Wang and Hakonarson; (2010). ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res.
2: Schubach et al. (2024). CADD v1.7: using protein language models, regulatory CNNs and other nucleotide-level scores to improve genome-wide variant predictions. Nucleic Acids Res.
3: Chen et al. (2024). A genomic mutational constraint map using variation in 76,156 human genomes. Nature.
4: Jaganathan, et al. (2019). Predicting Splicing from Primary Sequence with Deep Learning. Cell.
Please address comments and questions about NCBoost to barthelemy.caron@institutimagine.org and antonio.rausell@institutimagine.org
NCBoost 2 scripts, framework and databases are available under the Apache License 2.0.
Copyright 2025 Clinical BioInformatics Laboratory - Institut Imagine
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.