NCBoost 2

11 Feb 2026 : prescored files have been moved to a new temporary location and are now available for download.

NCBoost is a pathogenicity score of non-coding variants to be used in the study of Mendelian diseases. It is based on supervised learning on a comprehensive set of ancient, recent and ongoing purifying selection signals in humans. NCBoost was trained on a collection of 2336 high-confidence pathogenic non-coding variants associated with monogenic Mendelian diseases. NCBoost performs consistently across diverse independent testing data sets and outperforms other existing reference methods. Further information can be found at the NCBoost 2 paper.

Of note, the NCBoost software can score any type of genomic position, provided that the required puryfing selection features used by the model are available. However, it is important to realize that, among the set of high-confidence pathogenic non-coding variants that were used to train NCBoost, more than 98% were found at proximal cis-regulatory regions, with only 27 variants overlapping more distal intergenic regions. Thus, for consistency with the training process, the most reliable genomic contexts for the use of the NCBoost score are the proximal cis-regulatory regions of protein-coding genes.

Precomputed NCBoost scores in proximal cis-regulatory regions of protein-coding genes

We precomputed the NCBoost score for all variants at 1.88 billion non-coding genomic positions overlapping intergenic, intronic, 5'UTR, 3'UTR, upstream and downstream regions -i.e. closer than 1kb from the Transcription Start Site (TSS) and the Transcription End Site (TES), respectively- associated with a background set of 19433 protein-coding genes for which we could retrieve annotation features. Variant mapping and annotation of non-coding genomic positions was done through ANNOVAR software using the gene-based annotation option based on RefSeq (assembly version hg38). In the case of positions overlapping several types of regions associated with different genes and transcripts (either coding or non-coding), a number of criteria were adopted as described in the NCBoost 2 paper.

The precomputed hg38 NCBoost 2 scores in proximal cis-regulatory regions of protein-coding genes can be downloaded as:

tabix file (.gz, 100Go), and index file (gz.tbi), using wget:

wget https://nginx.sogam.org/files/ncboost_v2_hg38_20260202_light.tsv.gz
wget https://nginx.sogam.org/files/ncboost_v2_hg38_20260202_light.tsv.gz.tbi

parquet file (.parquet, 101Go)

wget https://nginx.sogam.org/files/ncboost_v2_hg38_20260202_light.parquet

These files contains the following columns:
chr, chromosome name, as [1:22,X,Y]
pos, 1-based genomic position (GrCh38 genome assembly)
region, type of non-coding region overlapping the position, as provided by ANNOVAR (see above)
closest_gene_name, name of the associated protein-coding gene
NCBoost_Score, NCBoost 2 score. NCBoost score ranges from 0 to 1. The higher the score, the higher the pathogenicity potential of the position.
NCBoost_chr_rank_perc, chromosome-wise rank percentile (ranging from 0 to 1) of the corresponding NCBoost 2 score. The higher the rank percentile, the higher the pathogenic potential of the position.

We advise the users to use the NCBoost_chr_rank_perc rather than NCBoost_score, as it is more comparable across chromosomes.

NCBoost gene database

The NCBoost gene database, integrating several database identifiers (Ensembl, HGNC, NCBI), OMIM disease-gene status and the gene-level conservation features used in this work for more than 19433 protein-codign genes are available here.

NCBoost software

The NCBoost software is also provided in this repository in case you are interested in training the NCBoost framework on your own variants or your own features, or assessing the NCBoost scores for genomic positions other than those included in the precomputed files.

The following sections will guide you through the steps needed for the annotation of variants, training and execution of NCBoost-2 pretrained models to obtain the pathogenicity score.

Downloads, installation and processing of input files

If you only want to add NCBoost score to your variants, follow steps 1 to 5. If you want to retrain NCBoost or add all NCBoost features to your variants, follow steps 1 to 2 and 6 to 8.

1. Download NCBoost 2 software

NCBoost models are stored as github large file object, and require the installation of github-lfs. On Linux, github-lfs can be easily installed by running:

sudo apt install git-lfs

NCBoost scripts and associated data may then be cloned from the NCBoost github repository:

git clone https://github.com/RausellLab/NCBoost-2.git
cd NCBoost-2

Check that the models were downloaded properly by running:

git lfs ls-files

which should output the 10 toy models and 10 ncboost models.

2. Install ncboost2 environment

The required python libraries are detailed in libraries.sh and can be installed using conda & pip as follows:

conda create --name ncboost2 python=3.10.14
conda activate ncboost2
bash libraries.sh

Alternatively, a .yml file containing all conda & pip libraries is also available and can be installed as follows:

conda env create --name ncboost2 --file=ncboost2.yml

3. Download NCBoost prescored file

Download the tabix-indexed and index files as describe below, and move them to data/prescored_WG:

wget https://nginx.sogam.org/files/ncboost_v2_hg38_20260202_light.tsv.gz
wget https://nginx.sogam.org/files/ncboost_v2_hg38_20260202_light.tsv.gz.tbi
mv ncboost_v2_hg38_20260202_light.tsv.* data/prescored_WG/

4. Variant input format

Input SNVs can be fed to NCBoost as either .tsv or .vcf files:

tsv file

Variants have to be reported in 1-based, GrCh38 genomic coordinates. The required variant file is a tab-delimited textfile with column headers following the format:

chr pos   ref  alt
1   12589   G   A

The chr column should not contain the 'chr' prefix. Any other columns can be added in addition to the required four columns.

vcf file

Variants have to be reported in 1-based, GrCh38 genomic coordinates. The CHROM column should not contain the 'chr' prefix.

5. Scoring variants with the prescored NCBoost score

Make sure that you are running the scripts from the root of this folder (NCBoost-2/).

tsv file scoring

NCBoost can score files following the format specified just above.

Don't forget to first select the ncboost2 environment before running the script in jupyter:

conda activate ncboost2

Then run:

python src/ncboost_annotate_tsv.py /path/to/tsv/file /data/ncboost_v2_prescored

example:

python src/ncboost_annotate_tsv.py data/testing/testing_data.tsv data/ncboost_v2_prescored

NCBoost's score per-chromosome rank percentile will be added as an extra column to the file.

vcf file scoring

NCBoost can also score single-row bi-allelic vcf files. Single-row multi-allelic loci vcfs should be parsed using bcftools first.

Don't forget to first select the ncboost2 environment before running the script in jupyter:

conda activate ncboost2

Then run:

python src/ncboost_annotate_vcf.py /path/to_vcf/file /data/ncboost_v2_prescored

example:

python src/ncboost_annotate_vcf.py data/testing/testing_data.vcf data/ncboost_v2_prescored

NCBoost's score per-chromosome rank percentile will be added at the end of the INFO field of each variant.

6. Download the feature file

NCBoost v2 features for all possible SNVs at 1,879,856,949 positions are available here (total size = 260Go) as per-chromosome compressed tabix-indexed files.

wget https://nginx.sogam.org/files/ncboost_v2_hg38_20260202_full.tar.gz

Unpack the tar file and move the data to the data/ folder:

tar -zxvf ncboost_v2_hg38_20260202_full.tar.gz
mv -r ncboost_v2_hg38_20260202_full data/ncboost_v2_prescored/

Complete details about each features are available at NCBoost 2 paper.

7. NCBoost training & SNV anotation with NCBoost features

NCBoost framework can be trained using the ncboost_train.ipynb script. It loads and annotate a set of pathogenic variants and the corresponding set of region-matched random common variants, train the 10 models and produce the corresponding feature importance plot, as well as ROC and PR curves. The trained models are saved and can be used for later scoring.

The annotation requires to download the full set of features used by NCBoost (260 Go). For convenience, we also provide the set of pathogenic and common variants already annotated with NCBoost features, so that re-training does not force one to download the feature file.

Don't forget to first select the ncboost2 environment before running the script in jupyter:

conda activate ncboost2

ncboost_train.ipynb should be run through a jupyter notebook environment, while the ncboost_train.py script should be run as follows:

python src/ncboost_train.py

8. NCBoost model inference

NCBoost framework can be applied to annotate and score any variant using the jupyter notebook ncboost_test.ipynb or its python version equivalent, ncboost_test.py. It will apply the trained framework used to generate the results in NCBoost 2 paper. The annotation requires to download the full set of features used by NCBoost (260Go).For convenience, we also provide a set of pathogenic and common variants already annotated with NCBoost features, so that scoring does not force one to download the feature file for the corresponding variants.

Don't forget to first activate the environment:

conda activate ncboost2

ncboost_test.ipynb should be run through a jupyter notebook environment, while the ncboost_test.py script should be run as follows:

python src/ncboost_test.py path/to/input/file.tsv

Example

python src/ncboost_test.py data/testing/testing_data.tsv

The output file will be created in the same folder as the input file, as a tab-delimited text file with the following columns: the chromosome, position, reference and alternative allele of the variant, the name and Ensembl Gene ID of the nearest gene to which the variant was associated and the corresponding non-coding region (upstream, downstream, UTR5, UTR3, intronic and intergenic), the gene type and 11 gene-based features (slr_dnds, gene_age, pLI, zscore_mis, zscore_syn, loeuf, GDI, ncRVIS, ncGERP, RVIS_percentile, pcGERP), using a reference of 19433 protein-coding genes, 6 one-hot encoded non-coding region types, 18 features extracted from CADD annotation files [2] (GC, CpG, pri/mam/verPhCons, pri/mam/verPhyloP, GerpN, GerpS, GerpRS, GerpRSpval, ZooPriPhyloP, ZooVerPhyloP, bStatistic, ZooRoCC, ZooUCE, Roulette-AR), the 9 MAF from GnomAD [3] (mean_MAF, mean_MAF_AFR/AMI/AMR/ASJ/EAS/FIN/MID/NFE/SAS), the CDTS score, the max SpliceAI score at the variant and at te position level [4], the substitution-encoding features and the NCBoost score and the extra columns provided by the user in the input file.
NCBoost score range from 0 to 1 (the higher the value, the higher the predicted pathogenicity).

More information about the can be found in NCBoost 2 paper.

References

1: Wang and Hakonarson; (2010). ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res.

2: Schubach et al. (2024). CADD v1.7: using protein language models, regulatory CNNs and other nucleotide-level scores to improve genome-wide variant predictions. Nucleic Acids Res.

3: Chen et al. (2024). A genomic mutational constraint map using variation in 76,156 human genomes. Nature.

4: Jaganathan, et al. (2019). Predicting Splicing from Primary Sequence with Deep Learning. Cell.

Contact

Please address comments and questions about NCBoost to barthelemy.caron@institutimagine.org and antonio.rausell@institutimagine.org

License

NCBoost 2 scripts, framework and databases are available under the Apache License 2.0.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Name		Name	Last commit message	Last commit date
Latest commit History 106 Commits
data		data
models/ncboost_models		models/ncboost_models
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
libraries.sh		libraries.sh
ncboost2.yml		ncboost2.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NCBoost 2

11 Feb 2026 : prescored files have been moved to a new temporary location and are now available for download.

Precomputed NCBoost scores in proximal cis-regulatory regions of protein-coding genes

NCBoost gene database

NCBoost software

Downloads, installation and processing of input files

1. Download NCBoost 2 software

2. Install ncboost2 environment

3. Download NCBoost prescored file

4. Variant input format

tsv file

vcf file

5. Scoring variants with the prescored NCBoost score

tsv file scoring

vcf file scoring

6. Download the feature file

7. NCBoost training & SNV anotation with NCBoost features

8. NCBoost model inference

References

Contact

License

About

Uh oh!

Releases 1

Packages

Languages

License

RausellLab/NCBoost-2

Folders and files

Latest commit

History

Repository files navigation

NCBoost 2

11 Feb 2026 : prescored files have been moved to a new temporary location and are now available for download.

Precomputed NCBoost scores in proximal cis-regulatory regions of protein-coding genes

NCBoost gene database

NCBoost software

Downloads, installation and processing of input files

1. Download NCBoost 2 software

2. Install ncboost2 environment

3. Download NCBoost prescored file

4. Variant input format

tsv file

vcf file

5. Scoring variants with the prescored NCBoost score

tsv file scoring

vcf file scoring

6. Download the feature file

7. NCBoost training & SNV anotation with NCBoost features

8. NCBoost model inference

References

Contact

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages