Capstone Project- Machine Learning Engineer with Microsoft Azure

This project has been submitted as part of the Machine Learning Engineer with Microsoft Azure Nanodegree. The aim of the project is to train models using Automated Machine Learning as well as by tuning hyperparameters with Hyperdrive. The best performing model is then deployed as a web service and is interacted with. The following diagrams highlight the architectures of both the HyperDrive run as well as the AutoML Run.

HyperDrive Run Architecture

AutoML Run Architecture

Dataset

Overview

The dataset that has been selected for this project is the Heart Failure Prediction Dataset from Kaggle. This dataset can be used to predict mortality from heart failure.

Task

The task performed is the prediction of a possible death event during the follow- up period of a patient. The dataset contains 12 features that can be used to predict mortality from heart failure:

age: Age of the patient
amaemia: Decrease of red blood cells or hemoglobin
creatinine_phosphokinase: Level of the CPK enzyme in the blood (mcg/L)
diabetes: If the patient has diabetes
ejection_fraction: Percentage of blood leaving the heart at each contraction
high_blood_pressure: If the patient has hypertension
platelets: Platelets in the blood (kiloplatelets/mL)
serum_creatinine: Level of serum creatinine in the blood (mg/dL)
serum_sodium: Level of serum sodium in the blood (mEq/L)
sex: Woman or man
smoking: If the patient smokes or not
time: Follow-up period (days)

The target column is DEATH_EVENT which tells if the patient deceased during the follow-up period

Access

The dataset has been downloaded from Kaggle and uploaded to this GitHub repository. The dataset is then accessed as a TabularDataset using the URL of the raw .csv file.

path_to_data= "https://raw.githubusercontent.com/neha7598/azure-ml-capstone/main/data/heart_failure_clinical_records_dataset.csv"
data=TabularDatasetFactory.from_delimited_files(path=path_to_data)

Automated ML

The AutomatedML Run was created using an instance of AutoMLConfig. The AutoML Config Class is a way of leveraging the AutoML SDK to automate machine learning. The following parameters have been used for the Auto ML Run.

Parameter	Value	Description
task	'classification'	Classification is selected since we are performing binary classification, i.e whether or not a death event occurs
debug.log	'automl_errors.log"	The debug information is written to this file instead of the automl.log file
training_data	train_data	train_data is passed that which contains the data to be used for training
label_column_name	'DEATH_EVENT'	Since the DEATH_EVENT column contains what we need to predict, it is passed
compute_target	compute_cluster	The compute target on which we want this AutoML experiment to run is specified
experiment_timeout_minutes	30	Specifies the time that all iterations combined can take. Due to the lack of resources this is selected as 30
primary_metric	'accuracy'	This is the metric that AutoML will optimize for model_selection. Accuracy is selected as it is well suited to problems involving binary classification.
enable_earli_stopping	True	Early Stopping is enabled to terminate a run in case the score is not improving in short term. This allows AutoML to explore more better models in less time
featurization	'auto'	Featurization is set to auto so that the featurization step is done automatically
n_cross_validations	4	This is specified so that there are 4 different trainings and each training uses 1/4 of data for validation
verbosity	logging.INFO	This specifies the verbosity level for writing to the log file

automl_config = AutoMLConfig(
    task='classification',
    training_data=train_data,
    label_column_name='DEATH_EVENT',
    n_cross_validations=4,
    compute_target=compute_cluster,
    **automl_settings
)

automl_settings = {
    "enable_early_stopping" : True,
    "experiment_timeout_minutes": 30,
    "featurization": 'auto',
    "primary_metric": 'accuracy',
    "verbosity": logging.INFO
}

Results

The model trained using AutoML searched for several algorithms to find which would perform best in this particular use case, several algorithms including LogisticRegression, SVM, Random Forest, MinMaxScaler, MaxAbsScaler, XGBoostClassifier, VotingEnsemble, etc were explored. The algorithm that performed the best was VotingEnsemble with an accuracy of 0.88701. AutoML automatically selected the best hyperparameters for the model training. AutoML automatically selects the algorithm and associated hyperparameters, the sampling policy, as well as the early stopping policy. It also selects algorithms that are blacklisted or won't work in that particular case (TensorFlowLinearClassifier and TensorFlowDNN in this case)

The following parameters were generated for the VotingEnsemble Model:

Parameter	Value
random_state	0
reg_alpha	2.0833333333333335
reg_lambda	1.7708333333333335
scale_pos_weight	1
seed	None
silent	None
subsample	0.9
tree_method	'hist'
verbose	-10
verbosity	0

The generated weights were- 0.2857142857142857, 0.14285714285714285, 0.14285714285714285, 0.2857142857142857, 0.14285714285714285

The details of the AutoML run can be monitored using the RunDetails Widget

Once the run was finished the summary of the run can be seen below-

The best model details are shown below-

Hyperparameter Tuning

The model used for hyperparameter tuning with HyperDrive is a Logistic Regression Model which is trained using a custom coded script- train.py. The dataset is fetched from a url as a TabularDataset. The hyperparameters chosen for the Scikit-learn model are regularization strength (C) and max iterations (max_iter).

"--C": uniform(0.001, 100),
"--max_iter": choice(50, 75, 100, 125, 150)

The hyperparameter tuning using HyperDrive requires several steps- Defining parameter search space, defining a sampling method, choosing a primary metric to optimize and selecting an early stopping policy.

The parameter sampling method used for this project is Random Sampling. It randomly selects the best hyperparameters for the model, that way the entire search space does not need to be searched. The random sampling method saves on time and is a lot faster than grid sampling and bayesian sampling which are recommended only if you have budget to explore the entire search space.

The early stopping policy used in this project is Bandit Policy which is based on a slack factor (0.1 in this case) and an evaluation interval (1 in this case). This policy terminates runs where the primary metric is not within the specified slack factor as compared to the best performing run. This would save on time and resources as runs which won't potentially lead to good results would be terminated early.

Results

The best HyperDrive run achieved an accuracy of 86.67%. The hyperparameters selected for the best HyperDrive run are listed below-

Parameter	Value
Regularization Strength (C)	85.35037
Max iterations (max_iter)	75

The details of the HyperDrive run are monitored using the Run Details widget.

The best model obtained from the HyperDrive Experiment achieved an accuracy of 86.67% The values of the hyperparameters selected for this model are shown below:

Model Deployment

Since the model trained using AutomatedML achieved a higher accuracy (88.701%), it was chosen for deployment.

Steps for Model Deployment

Register the Model

description = 'AutoML Model trained on heart failure data to predict if death event occurs or not'
tags = None
model = remote_run.register_model(model_name = model_name, description = description, tags = tags)

Define an Entry Script

The entry script receives data submitted to a deployed web service and passes it to the model. It then takes the response returned by the model and returns that to the client. For an AutoML model this script can be downloaded from files generated by the AutoML run. The following code snippet shows that.

script_file_name = 'inference/score.py'
best_run.download_file('outputs/scoring_file_v_1_0_0.py', 'inference/score.py')

Define an Inference Configuration

An inference configuration describes how to set up the web-service containing your model. It's used later, when you deploy the model.

inference_config = InferenceConfig(entry_script=script_file_name)

Define a Deployment Configuration

aciconfig = AciWebservice.deploy_configuration(cpu_cores = 1, 
                                               memory_gb = 1, 
                                               tags = {'area': "hfData", 'type': "automl_classification"}, 
                                               description = 'Heart Failure Prediction')

Deploy the Model

aci_service = Model.deploy(ws, aci_service_name, [model], inference_config, aciconfig)

Once the model is deployed the model endpoint can be accessed from the Endpoints sections in the Assets Tab.

The deployment state of the model can be seen as Healthy which indicates that the service is healthy and the endpoint is available.

Once the model has been deployed, requests were sent to the model. For sending requests to the model the scoring uri as well as the primary key (if authentication is enabled) are required. A post request is created and the format of the data that is needed to be sent can be inferred from the swagger documentation:

The following code interacts with the deployed model by sending it 2 data points specified here and in the data.json file.

import json

# URL for the web service, should be similar to:
# 'http://8530a665-66f3-49c8-a953-b82a2d312917.eastus.azurecontainer.io/score'
scoring_uri = aci_service.scoring_uri
# If the service is authenticated, set the key or token

# Two sets of data to score, so we get two results back
data = {"data":
        [
          {
            "age": 70.0,
            "anaemia": 1,
            "creatinine_phosphokinase": 4020,
            "diabetes": 1,
            "ejection_fraction": 32,
            "high_blood_pressure": 1,
            "platelets": 234558.23,
            "serum_creatinine": 1.4,
            "serum_sodium": 125,
            "sex": 0,
            "smoking": 1,
            "time": 12
          },
          {
            "age": 75.0,
            "anaemia": 0,
            "creatinine_phosphokinase": 4221,
            "diabetes": 0,
            "ejection_fraction": 22,
            "high_blood_pressure": 0,
            "platelets": 404567.23,
            "serum_creatinine": 1.1,
            "serum_sodium": 115,
            "sex": 1,
            "smoking": 0,
            "time": 7
          },
      ]
    }
# Convert to JSON string
input_data = json.dumps(data)
with open("data.json", "w") as _f:
    _f.write(input_data)

# Set the content type
headers = {'Content-Type': 'application/json'}
# If authentication is enabled, set the authorization header

# Make the request and display the response
resp = requests.post(scoring_uri, input_data, headers=headers)
print(resp.json())

**The result obtained from the deployed service is- **

The requests being sent to the model can be monitored through the Application Insights URL (If Application Insights are enabled) along with failed requests, time taken per request as well as the availability of the deployed service.

Future Improvements

Some areas of improvement for future experiments using HyperDrive include selecting different sampling methods and early_stopping policies as well as increasing the number of total runs. Selecting a different sampling method like Grid Sampling (as opposed to Random Sampling in this case) can lead to a more exhaustive search of the search space which can potentially give us a better result. Also, instead of Logistic Regression, the use of other algorithms like Random Fores, XGBoost, etc can be explored.

For AutoML, future experiments can explore having a experiment timeout time of more than 30 minutes, this can lead to a more exhaustive search and potentially better results. We can also select a different primary metric like "AUC_weighted" which is more suitable for datasets with large class imbalance.

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
automl_logs		automl_logs
data		data
inference		inference
outputs		outputs
screenshots		screenshots
swagger		swagger
training		training
.gitignore		.gitignore
README.md		README.md
automl.ipynb		automl.ipynb
automl_env.yml		automl_env.yml
data.json		data.json
hyperparameter_tuning.ipynb		hyperparameter_tuning.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Capstone Project- Machine Learning Engineer with Microsoft Azure

HyperDrive Run Architecture

AutoML Run Architecture

Dataset

Overview

Task

Access

Automated ML

Results

Hyperparameter Tuning

Results

Model Deployment

Steps for Model Deployment

Register the Model

Define an Entry Script

Define an Inference Configuration

Define a Deployment Configuration

Deploy the Model

Future Improvements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Capstone Project- Machine Learning Engineer with Microsoft Azure

HyperDrive Run Architecture

AutoML Run Architecture

Dataset

Overview

Task

Access

Automated ML

Results

Hyperparameter Tuning

Results

Model Deployment

Steps for Model Deployment

Register the Model

Define an Entry Script

Define an Inference Configuration

Define a Deployment Configuration

Deploy the Model

Future Improvements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages