This repository contains the scripts and datasets used for the development of the ExTRI2 pipeline.
Create the environment to run the scripts. We used python=3.12
python -m venv .extri2_venv
source .extri2_venv/bin/activate
pip install -r requirements.txt
python -m ipykernel install --user --name=extri2_venv
- Explain how to setup RBBT
workflow.rb. Main script to obtain TRI sentences from a folder of PubTator files.classifiers_training/. Standalone folder used to obtain the TRI and MoR classifiers to use inworkflow.rb. See theREADMEinside the folder for a more detailed explanation.scripts/All other scripts to aid the mainworklow.rbone, including:- Postprocessing:
postprocessing/to prepare the input for the main script,classifiers_trainingto prepare the data to train the classifiers.
- Preprocessing:
proprocessingto convert the output to the final ExTRI2 datasetvalidationto prepare the validation sets to manually validate
- Postprocessing:
data/All raw and intermediate files required to run the workflow. See more information in theREADMEinside the folder.results/contains the raw and final ExTRI2 resource, and the validated sentences.analysis/contains all analysis of the ExTRI2 dataset used for the ExTRI2 paper
The final ExTRI2 dataset required training the classifiers and improving the training dataset, preparing files for the workflow.rb, postprocessing the resulting files, and preparing sentences for validation. This was achieved by running the following scripts:
- Classifiers training:
- Classifiers were trained inside the folder
classifiers_training/(seeREADMEthere). - The best performing model was chosen with
analysis/classifiers_comparison.ipynb. - Classifier outputs were used to retroactively detect sentences to revise and improve the training dataset. Inside
scripts/classifiers_training:prepare_reannotation_Excels.ipynbprepares the sentences to revise.update_tri_sentences.ipynbandmake_train_data.ipynbupdate the datasets and prepare the files used for training the models.
- Classifiers were trained inside the folder
- Preprocessing:
preprocessing/get_NCBI_TF_IDs.ipynbobtains thetf_entrez_code.listwhich determines which Gene IDs are considered as TFspreprocessing/prepare_pubtator_for_ExTRI2.ipynb
Obtaining the files and models required to run the main ExTRI2 workflow:
data/tf_entrez_code.lista list of all dbTFs & coTFs. Obtained through runningscripts/preprocessing/get_NCBI_TF_IDs.ipynb.data/pubtator/with all PubMed abstracts containing TFs from the above list, in Pubtator format. To obtain, run:
cd scripts/preprocessing/
./get_all_pubtators.sh
python prepare_pubtator_for_ExTRI2.py
TRI_classifierandMoR_classifiermodels to classify the sentences.
- Specify where to find these classifiers
Run workflow.rb to get all candidate sentences to contain a TRI (Transcription Regulation Interation) along their MoR (Mode of Regulation). To run, use:
rbbt workflow.rb
# TODO - Miguel - What was the code exactly?
Check scripts/postprocessing/prepare_ExTRI2_resource.ipynb for an explanation on how was the final ExTRI2 resource created.
analysis/repo_to_paper.ipynb contains all analysis and figures created for the paper, as well as links to scripts used, divided by each of the paper's sections.
How to set up the .general_env (used to run all scripts but classifiers training)
python3 -m venv .general_env
source .general_venv/bin/activate/
pip install ipykernel
python3 -m ipykernel install --user --name .general_env
pip install pandas matplotlib torch biopython
- Ensure ^ is complete & explain classifiers_training env too