ExTRI2

Overview

This repository contains the scripts and datasets used for the development of the ExTRI2 pipeline.

Setup

Create the environment to run the scripts. We used python=3.12

python -m venv .extri2_venv
source .extri2_venv/bin/activate
pip install -r requirements.txt
python -m ipykernel install --user --name=extri2_venv

Explain how to setup RBBT

Structure

workflow.rb. Main script to obtain TRI sentences from a folder of PubTator files.
classifiers_training/. Standalone folder used to obtain the TRI and MoR classifiers to use in workflow.rb. See the README inside the folder for a more detailed explanation.
scripts/ All other scripts to aid the main worklow.rb one, including:
- Postprocessing:
  - postprocessing/ to prepare the input for the main script,
  - classifiers_training to prepare the data to train the classifiers.
- Preprocessing:
  - proprocessing to convert the output to the final ExTRI2 dataset
  - validation to prepare the validation sets to manually validate
data/ All raw and intermediate files required to run the workflow. See more information in the README inside the folder.
results/ contains the raw and final ExTRI2 resource, and the validated sentences.
analysis/ contains all analysis of the ExTRI2 dataset used for the ExTRI2 paper

Steps to obtain the results:

The final ExTRI2 dataset required training the classifiers and improving the training dataset, preparing files for the workflow.rb, postprocessing the resulting files, and preparing sentences for validation. This was achieved by running the following scripts:

Classifiers training:
- Classifiers were trained inside the folder classifiers_training/ (see README there).
- The best performing model was chosen with analysis/classifiers_comparison.ipynb.
- Classifier outputs were used to retroactively detect sentences to revise and improve the training dataset. Inside scripts/classifiers_training:
  - prepare_reannotation_Excels.ipynb prepares the sentences to revise.
  - update_tri_sentences.ipynb and make_train_data.ipynb update the datasets and prepare the files used for training the models.
Preprocessing:
- preprocessing/get_NCBI_TF_IDs.ipynb obtains the tf_entrez_code.list which determines which Gene IDs are considered as TFs
- preprocessing/prepare_pubtator_for_ExTRI2.ipynb

Preprocessing

Obtaining the files and models required to run the main ExTRI2 workflow:

data/tf_entrez_code.list a list of all dbTFs & coTFs. Obtained through running scripts/preprocessing/get_NCBI_TF_IDs.ipynb.
data/pubtator/ with all PubMed abstracts containing TFs from the above list, in Pubtator format. To obtain, run:

cd scripts/preprocessing/
./get_all_pubtators.sh
python prepare_pubtator_for_ExTRI2.py

TRI_classifier and MoR_classifier models to classify the sentences.

Specify where to find these classifiers

Workflow

Run workflow.rb to get all candidate sentences to contain a TRI (Transcription Regulation Interation) along their MoR (Mode of Regulation). To run, use:

rbbt workflow.rb
# TODO - Miguel - What was the code exactly?

Postprocessing

Check scripts/postprocessing/prepare_ExTRI2_resource.ipynb for an explanation on how was the final ExTRI2 resource created.

Analysis

analysis/repo_to_paper.ipynb contains all analysis and figures created for the paper, as well as links to scripts used, divided by each of the paper's sections.

Environments

How to set up the .general_env (used to run all scripts but classifiers training)

python3 -m venv .general_env
source .general_venv/bin/activate/
pip install ipykernel
python3 -m ipykernel install --user --name .general_env
pip install pandas matplotlib torch biopython

Ensure ^ is complete & explain classifiers_training env too

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
analysis		analysis
classifiers_training		classifiers_training
data		data
examples/tri_candidates		examples/tri_candidates
lib/ExTRI2		lib/ExTRI2
python		python
scripts		scripts
test		test
.gitignore		.gitignore
.vimproject		.vimproject
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
workflow.rb		workflow.rb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ExTRI2

Overview

Setup

Structure

Steps to obtain the results:

Preprocessing

Workflow

Postprocessing

Analysis

Environments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

Rbbt-Workflows/ExTRI2

Folders and files

Latest commit

History

Repository files navigation

ExTRI2

Overview

Setup

Structure

Steps to obtain the results:

Preprocessing

Workflow

Postprocessing

Analysis

Environments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages