The goal of this tutorial is to discuss and practice managing the end-to-end ML system development. An end-to-end ML system development includes many phases, such as data collection, data pre-processing, ML model serving, not just running experiments for (training of) the ML models.
Managing a machine learning project is a very complicated process involving a huge manual effort. How can you effectively manage the end-to-end process from having a dataset to serving prediction models?
We suggest you to follow all steps, especially studying and practicing key questions posted. However, you can use different datasets and different models to practice the tutorial. If you want to train a model using federated ML frameworks for advanced topics, you can check our Introduction to FL and use suitable tools mentioned in the slides, such as Flower. For experiment data, you can check the experiment of Fraud Detection in the experiment in our UCC paper.
You start building an ML model based on existing data and requirements. Key steps in preparing data
- Investigate the data - how/which ML model can be applied?
- What is the requirements for your ML model/service?
- Why do we need to manage metadata? which metadata of data/model must be managed?
- Dealing with dynamic changes in data quality, how can we improve model serving?
We will carry out a case study of ML development for predictive maintenance in BTS (Base Transceiver Stations). The example raw data can be accessed from: tutorials/MLProjectManagement/BTS_Example/raw_data
currently the raw data is not uploaded because we are still figuring out a way to not have to upload the whole day of data, contact us if you need raw data at the moment
- Do you understand the data?
- Is the current form of data good for us to start? What does it mean good?
- Which fields of data are important that can be used for ML?
- Who actually could help you to understand the data and its business?
- If you make some decisions, can you document and use the document to explain your decision to relevant stakeholder later on?
-
Check BTS data and discuss data fields
-
Requirements: for example, read our initial work on requirements for explainability in an end-to-end ML development and work on some selected questions.
- Do we need to transform the data?
- Should we enrich the data with additional data? where is the additional data?
- Which features should we select for ML models and why?
The BTS data that we have is still in its raw format and are not ready to be used for the prediction model. Thus, we need to preprocess the data. We would first convert the reading_time, then group the data by station_id and parameter_id. This process can be done by executing the code below, or the file group_data.py.
import pandas as pd
import os, fnmatch
# Init empty DataFrame for loading data
bts_df = pd.DataFrame()
# Read list of files in raw_data folder
listOfFiles = os.listdir('./raw_data')
# Take only .csv file
pattern = "*.csv"
# Load raw data to DataFrame
for files in listOfFiles:
if fnmatch.fnmatch(files, pattern):
cur_df = pd.read_csv("./raw_data/{}".format(files))
bts_df = bts_df.append(cur_df)
# Convert time from unix timestamp to integer
bts_df["unix_timestamp"] = pd.to_datetime(bts_df["reading_time"]).astype(int)
# Normalize timestamp data
mean_time = bts_df["unix_timestamp"].mean()
min_time = bts_df["unix_timestamp"].min()
bts_df["norm_time"] = (bts_df["unix_timestamp"]-mean_time)/(3600*1000000000)
# Sort data and remove unnecessary features
bts_df = bts_df.sort_values(by=['norm_time'])
bts_df.drop(["reading_time"], axis='columns', inplace=True)
# Group data by station ID and parameter ID
# -> making prediction on individual parameters for each station
bts_df_grouped = bts_df.groupby(["station_id","parameter_id"])
# Processing grouped data
for key,item in bts_df_grouped:
sub_data = bts_df_grouped.get_group(key)
# Normalize value in each group
mean_val = sub_data['value'].mean()
sub_data['norm_value'] = sub_data['value']-mean_val
max_val = sub_data['norm_value'].max()
sub_data['norm_value'] = sub_data['norm_value']/max_val
sub_data.sort_values(by=['norm_time']).to_csv("./data_grouped/{}_{}_.csv".format(key[0],key[1]), index=False)
print("Finish: {}".format(key))After grouping the data, we need to do some further pre-processing to turn our data into serial data, and normalize the data. You can use the following code, or open the file Model.ipynb and test the code directly.
- Note: You can simply install Jupyter Notebook:
$ pip install jupyter
# Register kernel with your current python virtual environment
$ python -m ipykernel install --user --name=<your_kernel_name>
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.layers.experimental import preprocessing
import mlflow
# Load grouped dataset
# We will train the ML model with dataset from station "1161114004" on the parameter "122"
grouped_dataset = pd.read_csv("./data_grouped/1161114004_122_.csv")
# Take data features with specific data type
grouped_dataset = grouped_dataset.astype({'id':'float','value':'float', 'station_id':'int', 'parameter_id':'int', 'unix_timestamp':'int', 'norm_time':'float'})
dataset = grouped_dataset.copy()
# Drop error/None data and unused features
dataset = dataset.dropna().drop(['id','station_id','parameter_id','unix_timestamp'], axis=1)
dataset_full = dataset.sort_values(by=['norm_time'])
# Shift the data to creat a short time series (6 data points) for each record
serial_data = dataset.drop(['value','norm_time'], axis=1)
serial_data['norm_1'] = serial_data['norm_value'].shift(1)
serial_data['norm_2'] = serial_data['norm_value'].shift(2)
serial_data['norm_3'] = serial_data['norm_value'].shift(3)
serial_data['norm_4'] = serial_data['norm_value'].shift(4)
serial_data['norm_5'] = serial_data['norm_value'].shift(5)
serial_data['norm_6'] = serial_data['norm_value'].shift(6)
train_dataset = serial_data[6:]You could do the data extraction from another power grid data file to test the model. We could also assign the starting line and ending line of the data that we use to test the model from the command line. The testing data should be pre-processed in the same way as the training data
# Read test dataset from a different file
# We will test the trained ML model with data from station "1161114002" on the same parameter "122"
test_file_name = "./data_grouped/1161114002_122_.csv"
test_grouped_dataset = pd.read_csv(test_file_name)
test_grouped_dataset = test_grouped_dataset.astype({'id':'float','value':'float', 'station_id':'int', 'parameter_id':'int', 'unix_timestamp':'int', 'norm_time':'float'})
test_dataset = test_grouped_dataset.copy()
test_dataset = test_dataset.dropna().drop(['id','station_id','parameter_id','unix_timestamp'], axis=1)
test_dataset_full = test_dataset.sort_values(by=['norm_time'])
# Choose a small part of the data to test the model
start_line = 0
end_line = 100
test_data = test_dataset_full[start_line:end_line]
test_data = test_dataset_full[start_line:end_line]
# Similar to training dataset -> making a time series input for ML prediction
test_serial_data = test_data.drop(['value','norm_time'], axis=1)
test_serial_data['norm_1'] = test_serial_data['norm_value'].shift(1)
test_serial_data['norm_2'] = test_serial_data['norm_value'].shift(2)
test_serial_data['norm_3'] = test_serial_data['norm_value'].shift(3)
test_serial_data['norm_4'] = test_serial_data['norm_value'].shift(4)
test_serial_data['norm_5'] = test_serial_data['norm_value'].shift(5)
test_serial_data['norm_6'] = test_serial_data['norm_value'].shift(6)
test_dataset = test_serial_data[6:]Concern:
We examine data and we start to use it for ML model development. What if we dont keep track the information about the data? What do you think?
Researchers have discussed the importance of keeping the information about data to be used for ML. Try to read the following papers:
- Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. 2021. Datasheets for datasets. Commun. ACM 64, 12 (December 2021), 86–92.
- why do we need to keep metadata about data?
- what are types of metadata? how to obtain them?
Currently, there are existing tools to store metadata of the dataset, e.g BigQuery, Atlas, and to manage and discovery data such as Google Cloud Data Catalog and IBM Knowledge Catalog, etc. However, there is a lack of framework for capturing complex relationships between the datasets and different entities (e.g: other dataset, ML model) in a ML solution.
Metadata of the dataset can be used to manage information of the datasets. Metadata object should include different aspects of the dataset, such as data management aspect (Name, version, url, size, created_time, last_modified_time, provider, description), dataset_dependency, and quality of data (completeness, label_ratio, etc). Following json is an example of the metadata for dataset used in this example:
[
{
"Dataset_id": "bts1",
"Name": "bts_param",
"Created_time": "",
"Last_modified_time": "05/31/2021, 14:07:25",
"Version": "1.0",
"URL": "examples/BTS_Data/raw_data",
"Provider": "BachPhu",
"Description": "",
"Size": {
"value": "194,5",
"unit": "MB"
},
"Data_type": [
"Time Series"
],
"Dataset_dependency": {
"bts02": {
"Type_of_relationship": "output"
}
},
"Quality_of_data": {
"completeness": {
"Value": "0.9"
},
"label_ratio": {
"Value": "0.9"
}
}
}
]In this tutorial:
You shall set the goal to find a suitable way that you want to keep metadata that can be updated and retrieved easily based on our ML development tools.
- Which ML algorithms should we choose in order to create suitable ML models? why?
- For explainability of your choice (to other stakeholders), how would you document the explanation of the choice?
Next steps are model training and model experiments. We will need to record performance metrics, machine information, etc. and associate them with the data to be used (and the metadata) so that we can have all information is linked for an end-to-end ML experiment. We assume that you follow existing methods to select ML algorithms to create suitable models. In this practice, we will use LSTM.
After having the model selection and suitable methods:
- How will you do the training and model experiments?
- How will we record performance metrics, machine information, etc. and associate them with the data to be used (and the metadata) so that we can have all information is linked for an end-to-end ML experiment.
- How would we manage thousands of experiments?
- How to evaluate or compare your experiments based on multiple metrics? What would be an appropriate solution?
There are many tools. MLflow might be a good tool to help you stay on top of what is going on. Using MLFlow, we study how to capture the relationships among configurable parameters, machine learning code, the input data, output result, and performance metrics.
MLflow introduction and installation: MLFlow is a popular python package for machine learning life cycle. It provides many functions such as follows:
- Tracking: track experiments to store parameters and results.
- Packaging: package the project code in reproducible form in order to share or transfer to production.
- Deploying: manage and deploy models from a variety of machine learning libraries.
MLflow allows us to collect experimental data for your machine learning applications. These data are usually useful for further analysis, statistics, prediction and optimization.
Other tools for storing experiments are:
To get you ready for the tutorial, please don't forget to install MLflow and scikit-learn first. It is recommended that you install Anaconda for simplifying package management and deployment. You can download the corresponding version of anaconda here
Use the package manager pip to install mlflow.
$pip install mlflowFor executing some examples of this tutorials, you need to install scikit-learn
$pip install scikit-learnYou would need to have a machine learning model to test out MLflow, you can use any model that you have already developed, or develop one if you feel like.
We use a BTS model whose code can be found in the file model.py or here
# MLflow start recording training metadata
with mlflow.start_run():
# Init ML model
model = keras.Sequential()
# check number of parameter
n = sys.argv[1] if len(sys.argv) > 1 else 2
node_param = []
# Load defaut model configuration if not given
file_name = sys.argv[2] if len(sys.argv) > 2 else "conf.txt"
with open(file_name, 'r') as f:
content = f.read()
node_param = content.split(",")
# setup model layer based on loaded configuration
for i in range(int(n)):
model.add(layers.LSTM(int(node_param[i]), return_sequences=True))
model.add(layers.TimeDistributed(layers.Dense(1)))
# Setup model optimizer
model.compile(loss='mean_squared_error', optimizer=tf.keras.optimizers.Adam(0.005))
# Train ML model
fitted_model = model.fit(train_features, train_labels, epochs=2, batch_size=1, verbose=2, validation_data=(test_features, test_labels))
# Create model signature
signature = infer_signature(test_features, model.predict(test_features))
# Let's check out how it looks
# MLflow log model training parameter
mlflow.log_param("number of layer", n)
mlflow.log_param("number of node each layer", node_param)
fit_history = fitted_model.history
# MLflow log training metric
for key in fit_history:
mlflow.log_metric(key, fit_history[key][-1])
model_dir_path = "./saved_model"
# Create an input example to store in the MLflow model registry
input_example = np.expand_dims(train_features[0], axis=0)
# Let's log the model in the MLflow model registry
model_name = 'LSTM_model'
mlflow.keras.log_model(model,model_name, signature=signature, input_example=input_example)We can log the input example for the model using mlflow.keras.log_model with input_example.
$python model.pyYou could also modify different parameters, such as the loss function, batch_size, epochs, or the test data file etc, and record them using mlflow.log_param.
- Note: You can practice training ML models with different configurations to create different model versions. Then, identify which metrics need to be managed by MLflow and how to compare performance between different versions.
Open a terminal in the current working directory and call MLflow user interface using the below command:
$mlflow ui- The results are illustrated in the Figure 1 where you can see all the logging parameters and metrics as well as different runs of your experiment. You can also see that the parameters and metrics are separate in the top row since they are logged with different MLflow api (log_param and log_metric.). In this example, we train the model with different set of layers and nodes for each layer, which is passed through the command line, and log by log_param, and the metric collected is the loss of the last epoch.
Now you have the metadata about data used, models and model experiments, you can check and link all the data together. For example:
{
"Model_id": "md1",
"Name": "BTS_LSTM",
"Status": [
"ready to use"
],
"Version": "1.0",
"Description": "",
"Time_created": "",
"Artefact": {
"Url": "",
"Format": "raw code",
"License": "000"
},
"Quality_of_model": {
"Accuracy": {
"operators": [
"min",
"max"
],
"unit": "percentage",
"value": "value",
"class": [
"Accuracy"
]
},
"MSE": {
"operators": [
"min",
"max"
],
"unit": "percentage",
"value": "value",
"class": [
"Accuracy"
]
}
},
"Dataset_dependecy": {
"bts02": {
"type_of_relationship": "input"
}
},
"Model_dependency": {},
"Package_dependency": {
"numpy": {
"version": "1.19.5",
"source": ""
},
"pandas": {
"version": "1.2.4"
}
}
}This metadata of the model capture the model in an end-to-end view to explain the relationship between data, model and metrics obtained from model experiment, all together are part of ML experiments.
- How to pack and move code to serving platforms
- Which service platforms should we use?
- How to deploy and manage ML services?
Given the model experimented, we can package and perform the model serving.
You can check our serving tutorial.
In the following explain basic steps to package models and record them.
Note: this example is based on the model mentioned in the previous section
Now, you want to share your ml code with other data scientist in a reusable, and reproducible form. You can packing the code in a virtual environment such as conda so that the code can be executed everywhere. In order to package the code using MLflow, you have to create MLproject and description files which define the requirements for executing the code. The below files are an example for packaging the code at https://github.com/mlflow/mlflow-example and execute it in the conda environment.
Create MLproject file
name: tutorial
conda_env: conda.yaml
entry_points:
main:
parameters:
configuration_file: {type: string, default: "conf.txt"}
command: "python model.py --conf {configuration_file}"
Create conda.yaml to define all requirements for the python program
name: tutorial
channels:
- defaults
dependencies:
- python=3.8.11
- numpy==1.18.5
- pandas>=1.0.0
- scikit-learn
- matplotlib
- pip
- pip:
- mlflow
- cffi==1.14.6
- pandas==1.2.4
- pillow==8.3.1
- pyyaml==5.4.1
- scipy==1.6.2
- tensorflow==2.4.1After defining the MLproject and conda.yaml files. You can run your code in another conda environment using the following command from the parent directory:
$ mlflow run BTS_Exampleor run it with custom parameters:
$ mlflow run BTS_Example/ -P configuration_file="conf.txt"Notably, the directory ml_experiments is where your MLproject and conda.yaml are located. Figure 2 is an illustration of the result after the program completed. As you can see in the picture, mlflow has created a conda environment for your project with a run id and executed your code in that environment. With this approach, your code can be executed everywhere that has mlflow.
Given the packaged models, you can select suitable one and deploy as a service.
In the following we show to do this using MLflow:
MLflow Model has a standard format for packaging machine learning models that can be used in a variety of downstream tools. For example, the model can be used to serve as a service through a REST API.
Student can go to the UI to check the saving model:
$ mlflow uior with the mlflow server:
$ mlflow server -h 0.0.0.0Deploy the server using the saved model:
$mlflow models serve -m mlruns/0/project_id/artifacts/LSTM_model -p 8888After the server is deployed successfully, you will see a result similar to the Figure 3 where your training model is deployed and ready to serve the prediction.
To test if the deployment works correctly, you can do prediction for your testing data using the deployed model such as follows: Testing data:
{
"inputs": [
[
[ 0.074], [-0.003], [-0.08] ,[ -0.157], [-0.235], [-0.312]]
]
}End point:
http://127.0.0.1:8888/invocations
You can also use the following simple client code in [client.py] to send request to the api endpoint.
# importing the requests library
import requests
# defining the api-endpoint
API_ENDPOINT = "http://0.0.0.0:8888/invocations"
# data to be sent to api
param = {
"inputs": [[[ 0.074], [-0.003], [-0.08] ,[ -0.157], [-0.235], [-0.312]]]
}
# sending post request and saving response as a json object
response = requests.post(url = API_ENDPOINT, json = param)
result = response.json()
# extracting result
print(result[0][0])- Now the model is deployed and running as a service. You can use monitoring techniques to monitor the service. Assume that you want to monitor more complex metrics such as cost, performance of your API functions, what are the suitable solutions?
- Then how can you link the monitoring data of the service back to the model, model experiments, trained data, etc.
We will do ML service monitoring using other tutorials
Rsources about ML engineering:
- Best Practices for ML Engineering by Martin Zinkevic
- "Operationalizing Machine Learning: An Interview Study"
- Machine Learning eXchange (MLX)
Part of the tutorial related to MLflow (experiments, package, serving) is built upon MLflow official documents. The main references are:
- https://www.mlflow.org/docs/latest/index.html
- https://www.mlflow.org/docs/latest/models.html#models
- https://mlflow.org/docs/latest/tutorials-and-examples/index.html
A old video for running MLflow with the Wine prediction


