Skip to content

katnastou/BioBERT-based-entity-type-classifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

53 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Training a multi-class classifier for increasing the accuracy of dictionary-based Named Entity Recognition

DOI

Scripts have been modified from the NVIDIA codebase and adapted to work on the Finnish supercomputer Puhti

Datasets

The consensus datasets that are used to train the model are available here. There are 2 datasets included in the Zenodo project:

  1. A small dataset with ca.125,000 training and 62,500 development examples used to perform a grid search to detect the best set of hyperparameters (125k-w100_grid_search_set)
  2. A large dataset of 12.5 million training and 62,500 testing examples to train the model used for prediction with the set of best hyperparameters identified above (12.5M-w100_train_test_set)

Minimal installation instructions on a linux system

Clone the repository

git clone https://github.com/katnastou/BioBERT-based-entity-type-classifier.git
cd BioBERT-based-entity-type-classifier 

Download BioBERT base model

wget http://nlp.dmis.korea.edu/projects/biobert-2020-checkpoints/biobert_v1.1_pubmed.tar.gz
mkdir -p models
tar -xvzf biobert_v1.1_pubmed.tar.gz -C models
rm biobert_v1.1_pubmed.tar.gz

Download training data from Zenodo

wget https://zenodo.org/records/10008720/files/125k-w100_grid_search_set.tar.gz
mkdir -p data
tar -xvzf 125k-w100_grid_search_set.tar.gz -C data
rm 125k-w100_grid_search_set.tar.gz

Install conda before proceeding. Instructions can be found here If you are on a server with conda environments pre-installed, you could alternatively load one that supports at least Python 3.8. See detailed requirements for the nvidia-tensorflow package here

If you need to set up Python:

wget https://www.python.org/ftp/python/3.8.12/Python-3.8.12.tgz
tar -xzvf Python-3.8.12.tgz
cd Python-3.8.12
./configure --prefix=$HOME/python38
make
make install
export PATH=$HOME/python38/bin:$PATH
#verify installation
python --version

If you don't need to install python 3.8 you can directly skip to this step:

python3.8 -m venv venv
source venv/bin/activate
python3.8 -m pip install --upgrade pip
python3.8 -m pip install --upgrade setuptools
python3.8 -m pip install wheel
python3.8 -m pip cache purge
python3.8 -m pip install nvidia-pyindex==1.0.5 #or from file: https://files.pythonhosted.org/packages/64/4c/dd413559179536b9b7247f15bf968f7e52b5f8c1d2183ceb3d5ea9284776/nvidia-pyindex-1.0.5.tar.gz
python3.8 -m pip install nvidia-tensorflow[horovod]==1.15.5 #or from file: https://github.com/NVIDIA/tensorflow/archive/refs/tags/v1.15.5+nv23.03.tar.gz

Install openmpi and update your paths:

wget https://download.open-mpi.org/release/open-mpi/v4.0/openmpi-4.0.1.tar.gz
tar -xzf openmpi-4.0.1.tar.gz
rm openmpi-4.0.1.tar.gz 
cd openmpi-4.0.1
./configure --prefix=${HOME}/openmpi
make all
make install
cd ..
export PATH=${HOME}/openmpi/bin:$PATH
export LD_LIBRARY_PATH=${HOME}/openmpi/lib:$LD_LIBRARY_PATH

And you are good to go!

Test your installation by running this script:

./run-entity-classification.sh

Instructions to install Tensorflow 1.15 with horovod support locally can also be found here: https://www.pugetsystems.com/labs/hpc/how-to-install-tensorflow-1-15-for-nvidia-rtx30-gpus-without-docker-or-cuda-install-2005/

Steps to train/finetune the model on the Puhti supercomputer

Grid search to find a set of hyperparameters

Run the script ./run-finetuning-grid.sh which invokes the script slurm/slurm-run-finetuning-grid.sh to run a grid search with the small dataset with the following hyperparameters:

models = BioBERT base
mini batch size = 32, 64
learning rate = 5e-5, 3e-5, 2e-5, 1e-5, 5e-6
number of epochs = 2, 3, 4
maximum sequence length = 256
repetitions = 3

Preliminary experiments showed that the maximum sequence length that could fit in memory provided the best results for the models and we decided to go with that option.

Link with results from grid search

The hyperparameters for the best-performing model on the development set are learning rate=2e-5, number_of_epochs=2, mini_batch_size=32, MSL=256, with a mean F-score=94.83% (SD=0.0356).

To get the stats for the finetuning grid run: python3 get_stat.py <logs_dir> <output_filename> (e.g. python3 get_stat.py finetuning-grid-logs/ finetuning-grid-results.tsv)

Training the model with the large dataset

We have trained a model with the large dataset using the best set of hyperparameters. The command on a supercomputer with a slurm workload manager to rerun the training is: sbatch slurm/slurm-run-finetuning-big.sh models/biobert_v1.1_pubmed 12.5M-w100_train_test_set 256 32 2e-5 1 consensus models/biobert_v1.1_pubmed/model.ckpt-1000000 data/biobert/other The results on the test set are: mean F-score=96.67% (SD=0). This model has been used to run predictions and generate blocklists for STRING database v12, DISEASES, and ORGANISMS, as well as updating dictionary files for Jensenlab tagger.

Running on Prediction mode for large-scale runs

The script to run on prediction mode for all types is:

run_predict_batch_auto_all_types.sh

Running the bash script ./run_predict_batch_auto_che.sh invokes the slurm script slurm/slurm-run-predict.sh to submit all jobs and generate predictions for all types, which are later used to generate probabilities.

Look at the README file and the setup scripts (setup.sh and setup-general.sh) within the generate_prediction_inputs directory for more details on how to run the entire pipeline from start to finish.

Technical considerations on Puhti

In order for this to work one needs to have a working installation of Tensorflow 1.15 with horovod support. Tensorflow 1.x support was deprecated on Puhti, so one needs to set up an environment first before running the scripts. To set this up on Puhti follow the instructions below:

module purge
#https://docs.csc.fi/computing/containers/tykky/
module load tykky
mkdir conda-env
conda-containerize new --prefix conda-env env.yml
#replace current path with your full path
export PATH="/current_path/conda-env/bin:$PATH"

python3 -m venv venv
source venv/bin/activate
python -m pip install --upgrade pip
python -m pip install --upgrade setuptools
python -m pip install wheel
python -m pip install nvidia-pyindex==1.0.5
python -m pip install nvidia-tensorflow[horovod]==1.15.5

#set up openmpi version 4.0.1
wget https://download.open-mpi.org/release/open-mpi/v4.0/openmpi-4.0.1.tar.gz
tar -xzf openmpi-4.0.1.tar.gz
rm openmpi-4.0.1.tar.gz 
cd openmpi-4.0.1
./configure --prefix=${HOME}/openmpi
make all
make install

export PATH=${HOME}/openmpi/bin:$PATH
export LD_LIBRARY_PATH=${HOME}/openmpi/lib:$LD_LIBRARY_PATH

In order to train a model with the large dataset for span classification one needs to execute:

sbatch slurm/slurm-run-finetuning-big.sh models/biobert_v1.1_pubmed 12.5M-w100_train_test_set 256 32 2e-5 1 consensus models/biobert_v1.1_pubmed/model.ckpt-1000000 data/biobert/other

the shell script calls the run_ner_consensus.py with some default values. Check the script for more details on what.

For instructions on how to run the model in prediction mode check setup.sh under the generate_prediction_inputs directory in this repository.

About

multiclass classifier for blocklisting

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors