- Install required python libraries with
poetry install
Basic requirements
python>=3.9.0
CUDA=11.8
torch=2.2.0- If you want to preprocess by yourself, please also install tools as following instructions.
cd-hit: https://github.com/weizhongli/cdhitViennaRNA: https://github.com/ViennaRNA/ViennaRNA
Processed sequence embedding & sequence csv files can be downloaded from here
-
Download GENCODE,
Protein-coding transcript sequencesfasta file from here -
Createing sequence df from GENCODE raw fasta file.
cd scripts
sh create_seq_df.sh
- Then,
gencode_v44(vM33)_utr_gene_unique.csvandgencode_v44(vM33)_utr_gene_unique_5utr(3utr).fafile will generate. - If you want to remove similar sequences, please run
scripts/cd_hit.shwith those fasta files.
- Getting sequence embeddings (model inputs).
- With
RNA-FM:sh get_emb_rnafm.sh - With
RiNALMo:sh get_emb_rinalmo.sh - For random forest feature:
sh get_rf_feature.sh
- Use
src/run_train_XX.pycode for training (replace XX from the below learning method abb table as you want). - Config also has name rule
config/<SPECIES>_<LEARNING_METHOD>.yaml
| abb | full |
|---|---|
| cl | contrastive learning |
| sv | supervised learning |
| rf | random forest |
- Run example
poetry run python run_train_cl.py --cfg ../config/human_cl.yaml-
crossval_analysis.ipynb:
Performs cross-validation analysis to evaluate the consistency of results across experiments. Visualizes the distribution of cosine similarity and correlations between different experiments. -
sequential_analysis.ipynb: Analyzes basic sequence features (e.g., lengths of 5'UTR, 3'UTR, CDS, and MFE) -
expression_analysis.ipynb: Analyzes translation efficiency (TE) using RNA-seq and Ribo-seq data for each cell line.
citation information will be written in here