m6Atect

Presenting our solution for DSA4262: m6atect by team Parkitect!
By Fang Yu, Eda, Kah Seng, Wen Yang

Getting Started

This repository is organised as follows:

root
├── .github        # GitHub configuration files (e.g., workflows for CI/CD)
├── data           # Input raw data sets (in JSON)
├── devo_notebooks # Notebooks for development
├── model          # Stored trained models
├── output         # Results in CSV format from model predictions
├── scripts        # Main scripts for data processing and model training
└── tests          # Unit tests for scripts

Quick Links


For DSA4262 fellow peer reviewers	Follow installation instructions Follow steps on generating predictions from our pre-trained model

Main Flow

The main workflow consists of data processing, model training, and generating predictions. The component diagram below provides a high-level view:

Installation

Clone the repo

git clone https://github.com/hoofangyu/dsa4262.git

Move into dsa4262 directory

cd dsa4262

Install required packages

sudo apt install python3-pip
python3 -m pip install -r requirements.txt

Grant permissions to run run script

chmod 500 run

Usage

Using our Pre-Trained Model

By using our pre-trained model, the workflow will consist only the data processing and prediction generation steps. Here is the high-level view:

Use our sample testset in the /data folder OR Move/download a testset directly to /data folder
Parse testset with parse_testset.py

python3 scripts/parse_testset.py <dataset_path> <output_file_name>

Run prediction with catboost_predictions.py

python3 scripts/catboost_predictions.py <parsed_test_set_path> <model_path> <output_name> [--parquet]

The --parquet flag is optional. Include this flag if you wish to save the output file as a Parquet format instead of the default CSV.

Note

Using run shell script

Alternatively, you may use our run script for convenience

Move or download the testset directly to /data folder
Parse testset and run predictions

./run <test_set_path> <parse_test_set_name> <trained_model_path> <predictions_output_name> [is_parquet]

The [is_parquet] option (true/false) is optional. Include this if you wish to save the output file as a Parquet format instead of the default CSV.

Example Usage (using our sample test set)

On AWS Ubuntu Instance (REFER TO THIS SECTION FOR STUDENT EVALUATION)

Tip

We recommend using the largest instance, m6i.4xlarge, if possible for faster computation speeds

Tip

To obtain AWS ip_address, you can run this command within the console of the AWS ubuntu instance

curl http://169.254.169.254/latest/meta-data/public-ipv4

With reference to the sample testset data/dataset2.json.gz
Run run shell script (on AWS)

./run data/dataset2.json.gz eval models/final_catboost_model.cbm dataset2_final_catboost_model_results

OR

Run individual python scripts (on AWS)

Parse testset & run predictions

python3 scripts/parse_testset.py data/dataset2.json.gz eval

Run prediction

python3 scripts/catboost_predictions.py data/eval.parquet models/final_catboost_model.cbm dataset2_final_catboost_model_results

Download predictions file from AWS to local machine

# scp -i <local_pem_file_path> <host_name@ip_address:path_to_data_folder_in_dsa4262_folder_on_aws> <local_testset_path>
scp -i parkitect.pem ubuntu@11.111.111.111:dsa4262/output/dataset2_final_catboost_model_results.csv .

Example Usage (for local file)

On AWS Ubuntu Instance

Upload local testset to /data folder. On your local console, run the following:

# scp -i <local_pem_file_path> <local_testset_path> <host_name@ip_address:path_to_data_folder_in_dsa4262_folder_on_aws>
scp -i parkitect.pem data/dataset2.json.gz ubuntu@11.111.111.111:dsa4262/data

Run run shell script (on AWS)

./run data/dataset2.json.gz eval models/final_catboost_model.cbm dataset2_final_catboost_model_results

OR

Run individual python scripts (on AWS)

Parse testset & run predictions

python3 scripts/parse_testset.py data/dataset2.json.gz eval

Run prediction

python3 scripts/catboost_predictions.py data/eval.parquet models/final_catboost_model.cbm dataset2_final_catboost_model_results

Download predictions file from AWS to local machine

# scp -i <local_pem_file_path> <host_name@ip_address:path_to_data_folder_in_dsa4262_folder_on_aws> <local_testset_path>
scp -i parkitect.pem ubuntu@11.111.111.111:dsa4262/output/dataset2_final_catboost_model_results.csv .

On Local

Move local testset to /data folder.
Run run shell script

./run data/dataset1.json.gz eval models/final_catboost_model.cbm dataset1_final_catboost_model_results true

OR

Run individual python scripts

Parse testset & run predictions

python3 scripts/parse_testset.py data/dataset1.json.gz eval

Run prediction

python3 scripts/catboost_predictions.py data/eval.parquet models/final_catboost_model.cbm dataset1_final_catboost_model_results

Example Usage (for public online file)

On AWS Ubuntu Instance

Download public testset to /data folder

aws s3 cp --no-sign-request s3://sg-nex-data/data/processed_data/m6Anet/SGNex_A549_directRNA_replicate5_run1/data.json data/

Run run shell script

./run data/data.json eval models/final_catboost_model.cbm SGNex_A549_directRNA_replicate5_run1_final_catboost_model_results

OR

Run individual python scripts

Parse testset & run predictions

python3 scripts/parse_testset.py data/data.json eval

Run prediction

python3 scripts/catboost_predictions.py data/eval.parquet models/final_catboost_model.cbm SGNex_A549_directRNA_replicate5_run1_final_catboost_model_results

Download predictions file from AWS to local machine

# scp -i <local_pem_file_path> <host_name@ip_address:path_to_data_folder_in_dsa4262_folder_on_aws> <local_testset_path>
scp -i parkitect.pem ubuntu@11.111.111.111:dsa4262/output/SGNex_A549_directRNA_replicate5_run1_final_catboost_model_results.csv .

Using our scripts to train your own model

By generating your own model with our scripts, the workflow follows that of the main flow: data processing followed by model training, and lastly generating predictions. Here is the high-level view:

Move or download the training sets directly to /data folder
Move or download the labels directly to /data folder
Parse training set with labels with parse_json.py

python3 scripts/parse_json.py <training_set_path> <output_file_name>

Train model using parsed training set from Step 3. with catboost_training.py.

python3 scripts/catboost_training.py <parsed_training_set_path> <output_file_name>

Parse test set with parse_testset.py

python3 scripts/parse_testset.py <test_set_path> <output_file_name>

Run predition using parsed test set from Step 5. and trained model from Step 4. with catboost_predictions.py

python3 scripts/catboost_predictions.py <parsed_test_set_path> <model_path> <output_name> [--parquet]

The --parquet flag is optional. Include this flag if you wish to save the output file as a Parquet format instead of the default CSV.

Example Usage (for local file)

On AWS Ubuntu Instance

Move local training and test set to /data folder.

# scp -i <local_pem_file_path> <local_testset_path> <host_name@ip_address:path_to_data_folder_in_dsa4262_folder_on_aws>
scp -i parkitect.pem data/dataset1.json.gz ubuntu@11.111.111.111:dsa4262/data
scp -i parkitect.pem data/dataset0.json.gz ubuntu@11.111.111.111:dsa4262/data

Parse training set

python3 scripts/parse_json.py data/dataset0.json.gz training

Train model

python3 scripts/catboost_training.py data/training.parquet cbmodel

Parse test set

python3 scripts/parse_testset.py data/dataset1.json.gz eval

Run prediction

python3 scripts/catboost_predictions.py data/eval.parquet models/cb_model.cbm dataset1_final_cb_model_results

Copy predictions file from AWS to local machine

# scp -i <local_pem_file_path> <host_name@ip_address:path_to_data_folder_in_dsa4262_folder_on_aws> <local_testset_path>
scp -i parkitect.pem ubuntu@11.111.111.111:dsa4262/output/cb_model.cbm dataset1_final_cb_model_results.csv .

On Local

Move local training and test set to /data folder.
Parse training set

python3 scripts/parse_json.py data/dataset0.json.gz training

Train model

python3 scripts/catboost_training.py data/training.parquet cbmodel

Parse test set

python3 scripts/parse_testset.py data/dataset1.json.gz eval

Run prediction

python3 scripts/catboost_predictions.py data/eval.parquet models/cb_model.cbm dataset1_final_cb_model_results

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

m6Atect

Getting Started

Quick Links

Main Flow

Installation

Usage

Using our Pre-Trained Model

Using run shell script

Example Usage (using our sample test set)

On AWS Ubuntu Instance (REFER TO THIS SECTION FOR STUDENT EVALUATION)

Example Usage (for local file)

On AWS Ubuntu Instance

On Local

Example Usage (for public online file)

On AWS Ubuntu Instance

Using our scripts to train your own model

Example Usage (for local file)

On AWS Ubuntu Instance

On Local

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 96 Commits
.github		.github
data		data
devo_notebooks		devo_notebooks
models		models
output		output
scripts		scripts
tests		tests
.gitignore		.gitignore
README.md		README.md
conftest.py		conftest.py
requirements.txt		requirements.txt
run		run

hoofangyu/dsa4262

Folders and files

Latest commit

History

Repository files navigation

m6Atect

Getting Started

Quick Links

Main Flow

Installation

Usage

Using our Pre-Trained Model

Using run shell script

Example Usage (using our sample test set)

On AWS Ubuntu Instance (REFER TO THIS SECTION FOR STUDENT EVALUATION)

Example Usage (for local file)

On AWS Ubuntu Instance

On Local

Example Usage (for public online file)

On AWS Ubuntu Instance

Using our scripts to train your own model

Example Usage (for local file)

On AWS Ubuntu Instance

On Local

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages