Training a multi-class classifier for increasing the accuracy of dictionary-based Named Entity Recognition
Scripts have been modified from the NVIDIA codebase and adapted to work on the Finnish supercomputer Puhti
The consensus datasets that are used to train the model are available here. There are 2 datasets included in the Zenodo project:
- A small dataset with ca.125,000 training and 62,500 development examples used to perform a grid search to detect the best set of hyperparameters (125k-w100_grid_search_set)
- A large dataset of 12.5 million training and 62,500 testing examples to train the model used for prediction with the set of best hyperparameters identified above (12.5M-w100_train_test_set)
Clone the repository
git clone https://github.com/katnastou/BioBERT-based-entity-type-classifier.git
cd BioBERT-based-entity-type-classifier
Download BioBERT base model
wget http://nlp.dmis.korea.edu/projects/biobert-2020-checkpoints/biobert_v1.1_pubmed.tar.gz
mkdir -p models
tar -xvzf biobert_v1.1_pubmed.tar.gz -C models
rm biobert_v1.1_pubmed.tar.gz
Download training data from Zenodo
wget https://zenodo.org/records/10008720/files/125k-w100_grid_search_set.tar.gz
mkdir -p data
tar -xvzf 125k-w100_grid_search_set.tar.gz -C data
rm 125k-w100_grid_search_set.tar.gz
Install conda before proceeding. Instructions can be found here If you are on a server with conda environments pre-installed, you could alternatively load one that supports at least Python 3.8. See detailed requirements for the nvidia-tensorflow package here
If you need to set up Python:
wget https://www.python.org/ftp/python/3.8.12/Python-3.8.12.tgz
tar -xzvf Python-3.8.12.tgz
cd Python-3.8.12
./configure --prefix=$HOME/python38
make
make install
export PATH=$HOME/python38/bin:$PATH
#verify installation
python --version
If you don't need to install python 3.8 you can directly skip to this step:
python3.8 -m venv venv
source venv/bin/activate
python3.8 -m pip install --upgrade pip
python3.8 -m pip install --upgrade setuptools
python3.8 -m pip install wheel
python3.8 -m pip cache purge
python3.8 -m pip install nvidia-pyindex==1.0.5 #or from file: https://files.pythonhosted.org/packages/64/4c/dd413559179536b9b7247f15bf968f7e52b5f8c1d2183ceb3d5ea9284776/nvidia-pyindex-1.0.5.tar.gz
python3.8 -m pip install nvidia-tensorflow[horovod]==1.15.5 #or from file: https://github.com/NVIDIA/tensorflow/archive/refs/tags/v1.15.5+nv23.03.tar.gz
Install openmpi and update your paths:
wget https://download.open-mpi.org/release/open-mpi/v4.0/openmpi-4.0.1.tar.gz
tar -xzf openmpi-4.0.1.tar.gz
rm openmpi-4.0.1.tar.gz
cd openmpi-4.0.1
./configure --prefix=${HOME}/openmpi
make all
make install
cd ..
export PATH=${HOME}/openmpi/bin:$PATH
export LD_LIBRARY_PATH=${HOME}/openmpi/lib:$LD_LIBRARY_PATH
And you are good to go!
Test your installation by running this script:
./run-entity-classification.sh
Instructions to install Tensorflow 1.15 with horovod support locally can also be found here: https://www.pugetsystems.com/labs/hpc/how-to-install-tensorflow-1-15-for-nvidia-rtx30-gpus-without-docker-or-cuda-install-2005/
Run the script ./run-finetuning-grid.sh which invokes the script slurm/slurm-run-finetuning-grid.sh to run a grid search with the small dataset with the following hyperparameters:
models = BioBERT base
mini batch size = 32, 64
learning rate = 5e-5, 3e-5, 2e-5, 1e-5, 5e-6
number of epochs = 2, 3, 4
maximum sequence length = 256
repetitions = 3
Preliminary experiments showed that the maximum sequence length that could fit in memory provided the best results for the models and we decided to go with that option.
Link with results from grid search
The hyperparameters for the best-performing model on the development set are learning rate=2e-5, number_of_epochs=2, mini_batch_size=32, MSL=256, with a mean F-score=94.83% (SD=0.0356).
To get the stats for the finetuning grid run: python3 get_stat.py <logs_dir> <output_filename> (e.g. python3 get_stat.py finetuning-grid-logs/ finetuning-grid-results.tsv)
We have trained a model with the large dataset using the best set of hyperparameters.
The command on a supercomputer with a slurm workload manager to rerun the training is: sbatch slurm/slurm-run-finetuning-big.sh models/biobert_v1.1_pubmed 12.5M-w100_train_test_set 256 32 2e-5 1 consensus models/biobert_v1.1_pubmed/model.ckpt-1000000 data/biobert/other
The results on the test set are: mean F-score=96.67% (SD=0).
This model has been used to run predictions and generate blocklists for STRING database v12, DISEASES, and ORGANISMS, as well as updating dictionary files for Jensenlab tagger.
The script to run on prediction mode for all types is:
run_predict_batch_auto_all_types.sh
Running the bash script ./run_predict_batch_auto_che.sh invokes the slurm script slurm/slurm-run-predict.sh to submit all jobs and generate predictions for all types, which are later used to generate probabilities.
Look at the README file and the setup scripts (setup.sh and setup-general.sh) within the generate_prediction_inputs directory for more details on how to run the entire pipeline from start to finish.
In order for this to work one needs to have a working installation of Tensorflow 1.15 with horovod support. Tensorflow 1.x support was deprecated on Puhti, so one needs to set up an environment first before running the scripts. To set this up on Puhti follow the instructions below:
module purge
#https://docs.csc.fi/computing/containers/tykky/
module load tykky
mkdir conda-env
conda-containerize new --prefix conda-env env.yml
#replace current path with your full path
export PATH="/current_path/conda-env/bin:$PATH"
python3 -m venv venv
source venv/bin/activate
python -m pip install --upgrade pip
python -m pip install --upgrade setuptools
python -m pip install wheel
python -m pip install nvidia-pyindex==1.0.5
python -m pip install nvidia-tensorflow[horovod]==1.15.5
#set up openmpi version 4.0.1
wget https://download.open-mpi.org/release/open-mpi/v4.0/openmpi-4.0.1.tar.gz
tar -xzf openmpi-4.0.1.tar.gz
rm openmpi-4.0.1.tar.gz
cd openmpi-4.0.1
./configure --prefix=${HOME}/openmpi
make all
make install
export PATH=${HOME}/openmpi/bin:$PATH
export LD_LIBRARY_PATH=${HOME}/openmpi/lib:$LD_LIBRARY_PATH
In order to train a model with the large dataset for span classification one needs to execute:
sbatch slurm/slurm-run-finetuning-big.sh models/biobert_v1.1_pubmed 12.5M-w100_train_test_set 256 32 2e-5 1 consensus models/biobert_v1.1_pubmed/model.ckpt-1000000 data/biobert/other
the shell script calls the run_ner_consensus.py with some default values. Check the script for more details on what.
For instructions on how to run the model in prediction mode check setup.sh under the generate_prediction_inputs directory in this repository.