Skip to content

hoofangyu/dsa4262

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

96 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

m6Atect

Presenting our solution for DSA4262: m6atect by team Parkitect!
By Fang Yu, Eda, Kah Seng, Wen Yang

Getting Started

This repository is organised as follows:

root
├── .github        # GitHub configuration files (e.g., workflows for CI/CD)
├── data           # Input raw data sets (in JSON)
├── devo_notebooks # Notebooks for development
├── model          # Stored trained models
├── output         # Results in CSV format from model predictions
├── scripts        # Main scripts for data processing and model training
└── tests          # Unit tests for scripts

Quick Links

For DSA4262 fellow peer reviewers
  1. Follow installation instructions
  2. Follow steps on generating predictions from our pre-trained model

Main Flow

The main workflow consists of data processing, model training, and generating predictions. The component diagram below provides a high-level view:

flow diagram

Installation

  1. Clone the repo
git clone https://github.com/hoofangyu/dsa4262.git 
  1. Move into dsa4262 directory
cd dsa4262
  1. Install required packages
sudo apt install python3-pip
python3 -m pip install -r requirements.txt
  1. Grant permissions to run run script
chmod 500 run

Usage

Using our Pre-Trained Model

By using our pre-trained model, the workflow will consist only the data processing and prediction generation steps. Here is the high-level view:

flow diagram

  1. Use our sample testset in the /data folder OR Move/download a testset directly to /data folder
  2. Parse testset with parse_testset.py
python3 scripts/parse_testset.py <dataset_path> <output_file_name>
  1. Run prediction with catboost_predictions.py
python3 scripts/catboost_predictions.py <parsed_test_set_path> <model_path> <output_name> [--parquet]

The --parquet flag is optional. Include this flag if you wish to save the output file as a Parquet format instead of the default CSV.

Note

Using run shell script

Alternatively, you may use our run script for convenience

  1. Move or download the testset directly to /data folder
  2. Parse testset and run predictions
./run <test_set_path> <parse_test_set_name> <trained_model_path> <predictions_output_name> [is_parquet]

The [is_parquet] option (true/false) is optional. Include this if you wish to save the output file as a Parquet format instead of the default CSV.


Example Usage (using our sample test set)

On AWS Ubuntu Instance (REFER TO THIS SECTION FOR STUDENT EVALUATION)

Tip

We recommend using the largest instance, m6i.4xlarge, if possible for faster computation speeds

Tip

To obtain AWS ip_address, you can run this command within the console of the AWS ubuntu instance

curl http://169.254.169.254/latest/meta-data/public-ipv4
  1. With reference to the sample testset data/dataset2.json.gz
  2. Run run shell script (on AWS)
./run data/dataset2.json.gz eval models/final_catboost_model.cbm dataset2_final_catboost_model_results 

OR

  1. Run individual python scripts (on AWS)

    1. Parse testset & run predictions
    python3 scripts/parse_testset.py data/dataset2.json.gz eval
    1. Run prediction
    python3 scripts/catboost_predictions.py data/eval.parquet models/final_catboost_model.cbm dataset2_final_catboost_model_results
  2. Download predictions file from AWS to local machine

# scp -i <local_pem_file_path> <host_name@ip_address:path_to_data_folder_in_dsa4262_folder_on_aws> <local_testset_path>
scp -i parkitect.pem ubuntu@11.111.111.111:dsa4262/output/dataset2_final_catboost_model_results.csv .

Example Usage (for local file)

On AWS Ubuntu Instance

  1. Upload local testset to /data folder. On your local console, run the following:
# scp -i <local_pem_file_path> <local_testset_path> <host_name@ip_address:path_to_data_folder_in_dsa4262_folder_on_aws>
scp -i parkitect.pem data/dataset2.json.gz ubuntu@11.111.111.111:dsa4262/data
  1. Run run shell script (on AWS)
./run data/dataset2.json.gz eval models/final_catboost_model.cbm dataset2_final_catboost_model_results 

OR

  1. Run individual python scripts (on AWS)

    1. Parse testset & run predictions
    python3 scripts/parse_testset.py data/dataset2.json.gz eval
    1. Run prediction
    python3 scripts/catboost_predictions.py data/eval.parquet models/final_catboost_model.cbm dataset2_final_catboost_model_results
  2. Download predictions file from AWS to local machine

# scp -i <local_pem_file_path> <host_name@ip_address:path_to_data_folder_in_dsa4262_folder_on_aws> <local_testset_path>
scp -i parkitect.pem ubuntu@11.111.111.111:dsa4262/output/dataset2_final_catboost_model_results.csv .

On Local

  1. Move local testset to /data folder.
  2. Run run shell script
./run data/dataset1.json.gz eval models/final_catboost_model.cbm dataset1_final_catboost_model_results true 

OR

  1. Run individual python scripts
    1. Parse testset & run predictions
    python3 scripts/parse_testset.py data/dataset1.json.gz eval
    1. Run prediction
    python3 scripts/catboost_predictions.py data/eval.parquet models/final_catboost_model.cbm dataset1_final_catboost_model_results

Example Usage (for public online file)

On AWS Ubuntu Instance

  1. Download public testset to /data folder
aws s3 cp --no-sign-request s3://sg-nex-data/data/processed_data/m6Anet/SGNex_A549_directRNA_replicate5_run1/data.json data/
  1. Run run shell script
./run data/data.json eval models/final_catboost_model.cbm SGNex_A549_directRNA_replicate5_run1_final_catboost_model_results 

OR

  1. Run individual python scripts

    1. Parse testset & run predictions
    python3 scripts/parse_testset.py data/data.json eval
    1. Run prediction
    python3 scripts/catboost_predictions.py data/eval.parquet models/final_catboost_model.cbm SGNex_A549_directRNA_replicate5_run1_final_catboost_model_results
  2. Download predictions file from AWS to local machine

# scp -i <local_pem_file_path> <host_name@ip_address:path_to_data_folder_in_dsa4262_folder_on_aws> <local_testset_path>
scp -i parkitect.pem ubuntu@11.111.111.111:dsa4262/output/SGNex_A549_directRNA_replicate5_run1_final_catboost_model_results.csv .

Using our scripts to train your own model

By generating your own model with our scripts, the workflow follows that of the main flow: data processing followed by model training, and lastly generating predictions. Here is the high-level view:

flow diagram

  1. Move or download the training sets directly to /data folder
  2. Move or download the labels directly to /data folder
  3. Parse training set with labels with parse_json.py
python3 scripts/parse_json.py <training_set_path> <output_file_name>
  1. Train model using parsed training set from Step 3. with catboost_training.py.
python3 scripts/catboost_training.py <parsed_training_set_path> <output_file_name>
  1. Parse test set with parse_testset.py
python3 scripts/parse_testset.py <test_set_path> <output_file_name>
  1. Run predition using parsed test set from Step 5. and trained model from Step 4. with catboost_predictions.py
python3 scripts/catboost_predictions.py <parsed_test_set_path> <model_path> <output_name> [--parquet]

The --parquet flag is optional. Include this flag if you wish to save the output file as a Parquet format instead of the default CSV.



Example Usage (for local file)

On AWS Ubuntu Instance

  1. Move local training and test set to /data folder.
# scp -i <local_pem_file_path> <local_testset_path> <host_name@ip_address:path_to_data_folder_in_dsa4262_folder_on_aws>
scp -i parkitect.pem data/dataset1.json.gz ubuntu@11.111.111.111:dsa4262/data
scp -i parkitect.pem data/dataset0.json.gz ubuntu@11.111.111.111:dsa4262/data
  1. Parse training set
python3 scripts/parse_json.py data/dataset0.json.gz training
  1. Train model
python3 scripts/catboost_training.py data/training.parquet cbmodel
  1. Parse test set
python3 scripts/parse_testset.py data/dataset1.json.gz eval
  1. Run prediction
python3 scripts/catboost_predictions.py data/eval.parquet models/cb_model.cbm dataset1_final_cb_model_results
  1. Copy predictions file from AWS to local machine
# scp -i <local_pem_file_path> <host_name@ip_address:path_to_data_folder_in_dsa4262_folder_on_aws> <local_testset_path>
scp -i parkitect.pem ubuntu@11.111.111.111:dsa4262/output/cb_model.cbm dataset1_final_cb_model_results.csv .

On Local

  1. Move local training and test set to /data folder.
  2. Parse training set
python3 scripts/parse_json.py data/dataset0.json.gz training
  1. Train model
python3 scripts/catboost_training.py data/training.parquet cbmodel
  1. Parse test set
python3 scripts/parse_testset.py data/dataset1.json.gz eval
  1. Run prediction
python3 scripts/catboost_predictions.py data/eval.parquet models/cb_model.cbm dataset1_final_cb_model_results

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •