Presenting our solution for DSA4262: m6atect by team Parkitect!
By Fang Yu, Eda, Kah Seng, Wen Yang
This repository is organised as follows:
root
├── .github # GitHub configuration files (e.g., workflows for CI/CD)
├── data # Input raw data sets (in JSON)
├── devo_notebooks # Notebooks for development
├── model # Stored trained models
├── output # Results in CSV format from model predictions
├── scripts # Main scripts for data processing and model training
└── tests # Unit tests for scripts
| For DSA4262 fellow peer reviewers |
|
The main workflow consists of data processing, model training, and generating predictions. The component diagram below provides a high-level view:
- Clone the repo
git clone https://github.com/hoofangyu/dsa4262.git - Move into dsa4262 directory
cd dsa4262- Install required packages
sudo apt install python3-pip
python3 -m pip install -r requirements.txt- Grant permissions to run
runscript
chmod 500 runBy using our pre-trained model, the workflow will consist only the data processing and prediction generation steps. Here is the high-level view:
- Use our sample testset in the /data folder OR Move/download a testset directly to /data folder
- Parse testset with
parse_testset.py
python3 scripts/parse_testset.py <dataset_path> <output_file_name>- Run prediction with
catboost_predictions.py
python3 scripts/catboost_predictions.py <parsed_test_set_path> <model_path> <output_name> [--parquet]The --parquet flag is optional. Include this flag if you wish to save the output file as a Parquet format instead of the default CSV.
Note
Alternatively, you may use our run script for convenience
- Move or download the testset directly to /data folder
- Parse testset and run predictions
./run <test_set_path> <parse_test_set_name> <trained_model_path> <predictions_output_name> [is_parquet]The [is_parquet] option (true/false) is optional. Include this if you wish to save the output file as a Parquet format instead of the default CSV.
Tip
We recommend using the largest instance, m6i.4xlarge, if possible for faster computation speeds
Tip
To obtain AWS ip_address, you can run this command within the console of the AWS ubuntu instance
curl http://169.254.169.254/latest/meta-data/public-ipv4
- With reference to the sample testset
data/dataset2.json.gz - Run
runshell script (on AWS)
./run data/dataset2.json.gz eval models/final_catboost_model.cbm dataset2_final_catboost_model_results OR
-
Run individual python scripts (on AWS)
- Parse testset & run predictions
python3 scripts/parse_testset.py data/dataset2.json.gz eval- Run prediction
python3 scripts/catboost_predictions.py data/eval.parquet models/final_catboost_model.cbm dataset2_final_catboost_model_results
-
Download predictions file from AWS to local machine
# scp -i <local_pem_file_path> <host_name@ip_address:path_to_data_folder_in_dsa4262_folder_on_aws> <local_testset_path>
scp -i parkitect.pem ubuntu@11.111.111.111:dsa4262/output/dataset2_final_catboost_model_results.csv .- Upload local testset to /data folder. On your local console, run the following:
# scp -i <local_pem_file_path> <local_testset_path> <host_name@ip_address:path_to_data_folder_in_dsa4262_folder_on_aws>
scp -i parkitect.pem data/dataset2.json.gz ubuntu@11.111.111.111:dsa4262/data- Run
runshell script (on AWS)
./run data/dataset2.json.gz eval models/final_catboost_model.cbm dataset2_final_catboost_model_results OR
-
Run individual python scripts (on AWS)
- Parse testset & run predictions
python3 scripts/parse_testset.py data/dataset2.json.gz eval- Run prediction
python3 scripts/catboost_predictions.py data/eval.parquet models/final_catboost_model.cbm dataset2_final_catboost_model_results
-
Download predictions file from AWS to local machine
# scp -i <local_pem_file_path> <host_name@ip_address:path_to_data_folder_in_dsa4262_folder_on_aws> <local_testset_path>
scp -i parkitect.pem ubuntu@11.111.111.111:dsa4262/output/dataset2_final_catboost_model_results.csv .- Move local testset to /data folder.
- Run
runshell script
./run data/dataset1.json.gz eval models/final_catboost_model.cbm dataset1_final_catboost_model_results true OR
- Run individual python scripts
- Parse testset & run predictions
python3 scripts/parse_testset.py data/dataset1.json.gz eval- Run prediction
python3 scripts/catboost_predictions.py data/eval.parquet models/final_catboost_model.cbm dataset1_final_catboost_model_results
- Download public testset to /data folder
aws s3 cp --no-sign-request s3://sg-nex-data/data/processed_data/m6Anet/SGNex_A549_directRNA_replicate5_run1/data.json data/- Run
runshell script
./run data/data.json eval models/final_catboost_model.cbm SGNex_A549_directRNA_replicate5_run1_final_catboost_model_results OR
-
Run individual python scripts
- Parse testset & run predictions
python3 scripts/parse_testset.py data/data.json eval- Run prediction
python3 scripts/catboost_predictions.py data/eval.parquet models/final_catboost_model.cbm SGNex_A549_directRNA_replicate5_run1_final_catboost_model_results
-
Download predictions file from AWS to local machine
# scp -i <local_pem_file_path> <host_name@ip_address:path_to_data_folder_in_dsa4262_folder_on_aws> <local_testset_path>
scp -i parkitect.pem ubuntu@11.111.111.111:dsa4262/output/SGNex_A549_directRNA_replicate5_run1_final_catboost_model_results.csv .By generating your own model with our scripts, the workflow follows that of the main flow: data processing followed by model training, and lastly generating predictions. Here is the high-level view:
- Move or download the training sets directly to /data folder
- Move or download the labels directly to /data folder
- Parse training set with labels with
parse_json.py
python3 scripts/parse_json.py <training_set_path> <output_file_name>- Train model using parsed training set from Step 3. with
catboost_training.py.
python3 scripts/catboost_training.py <parsed_training_set_path> <output_file_name>- Parse test set with
parse_testset.py
python3 scripts/parse_testset.py <test_set_path> <output_file_name>- Run predition using parsed test set from Step 5. and trained model from Step 4. with
catboost_predictions.py
python3 scripts/catboost_predictions.py <parsed_test_set_path> <model_path> <output_name> [--parquet]The --parquet flag is optional. Include this flag if you wish to save the output file as a Parquet format instead of the default CSV.
- Move local training and test set to /data folder.
# scp -i <local_pem_file_path> <local_testset_path> <host_name@ip_address:path_to_data_folder_in_dsa4262_folder_on_aws>
scp -i parkitect.pem data/dataset1.json.gz ubuntu@11.111.111.111:dsa4262/data
scp -i parkitect.pem data/dataset0.json.gz ubuntu@11.111.111.111:dsa4262/data- Parse training set
python3 scripts/parse_json.py data/dataset0.json.gz training- Train model
python3 scripts/catboost_training.py data/training.parquet cbmodel- Parse test set
python3 scripts/parse_testset.py data/dataset1.json.gz eval- Run prediction
python3 scripts/catboost_predictions.py data/eval.parquet models/cb_model.cbm dataset1_final_cb_model_results- Copy predictions file from AWS to local machine
# scp -i <local_pem_file_path> <host_name@ip_address:path_to_data_folder_in_dsa4262_folder_on_aws> <local_testset_path>
scp -i parkitect.pem ubuntu@11.111.111.111:dsa4262/output/cb_model.cbm dataset1_final_cb_model_results.csv .- Move local training and test set to /data folder.
- Parse training set
python3 scripts/parse_json.py data/dataset0.json.gz training- Train model
python3 scripts/catboost_training.py data/training.parquet cbmodel- Parse test set
python3 scripts/parse_testset.py data/dataset1.json.gz eval- Run prediction
python3 scripts/catboost_predictions.py data/eval.parquet models/cb_model.cbm dataset1_final_cb_model_results
