Skip to content

Benchmarking Framework for Machine Learning Algorithms on Embedded Devices

License

Notifications You must be signed in to change notification settings

EmbeddedML-Benchmark/Benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

121 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Logo

A Benchmarking Framework for Machine Learning Algorithms on Embedded Devices

Table of Contents

Overview

This project benchmarks the performance of the STM32F103 microcontroller in running machine learning models for keyword spotting, image classification, anomaly detection, and emotion recognition. We convert models into optimized formats and analyze inference speed, memory usage, and power consumption to assess their feasibility on embedded systems. We used Keil uVision5 for firmware development and provided a Python-based simulation environment for validating models before deployment. This work helps assess the practical applications of machine learning on low-power embedded devices.

Tools

Hardware

  • STM32 Development Board: A STM32F103C8 Board
  • ST-Link programmer for flashing the firmware

Software

  • Development Tools:
    • Keil uVision5 (ARM uVision 5, version 5.x)
  • Libraries:
    • TensorFlow Lite for Microcontrollers: A lightweight version of TensorFlow designed for microcontroller environments.
  • Other Tools:
    • Python 3.x: For scripting and model preparation.
    • Git: For version control.

Implementation Details

Deployment with Keil uVision5

1. Model Preparation and Conversion

  • Convert trained model to TensorFlow Lite format using ModelConverter.py.
  • Validate the TensorFlow Lite model with ValidateTFModel.py.
  • Convert the .tflite model into a C header file using the command:
    xxd -i speech_commands_model_float32.tflite > model_data.h
    This embeds the model data directly into firmware.

2. Firmware Development in Keil uVision5

  • Open the Keil project file (e.g., RUN.uvprojx) located in the Code/STM32 folder.
  • Configure the target device as an STM32F103 microcontroller, ensuring clock and memory settings match the board.
  • In the main source file (e.g., Runner.c), implement code to:
    • Capture audio data.
    • Run inference using the embedded model from model_data.h.
    • Measure and print inference time and predicted labels to a serial terminal.

Keil uVision5 Board Menu

Keil uVision5 Manage Run-Time Environment Menu

Run-Time Environment Config

Project Structure

Cortex-M Target Driver Setup, Debug Menu

Cortex-M Target Driver Setup, Flash Download Menu

3. Building and Flashing the Firmware

  • Build the Project: In Keil uVision5, navigate to Project > Open Project... to load your project, then click the Build button to compile the firmware.
  • Flash the Firmware: Connect the STM32F103 board via ST-Link, and use the Download option to flash the firmware. After flashing, reset or power cycle the board to start the benchmark application.

4. Running the Benchmark

  • Open a serial terminal (e.g., PuTTY, Tera Term) to monitor the output from the STM32F103 board.
  • The device will print inference times and predicted labels, this allows analyzing performance metrics such as inference latency, memory usage, and power consumption.

Simulation Environment for Python Scripts

  1. Install Python and Dependencies:

    • Ensure that Python 3.x is installed on your system.
    • Open a command prompt in the repository’s root directory.
    • Install required Python packages using:
      pip install -r Code/requirements.txt
  2. Run Simulation Scripts:

    • To validate the model conversion or simulate inference on your computer, navigate to the appropriate folder (e.g., Code).
    • Execute simulation scripts such as:
      python ValidateTFModel.py
    • These scripts will run the TensorFlow Lite model in a simulated environment and output performance metrics and inference results.

Scenarios

As we can interpret from the table below, in this work, we assess four different scenarios, targeting different tasks in ML, like speech recognition, anomaly detection, image classification, and lastly, emotion detection in the domain of natural language processing.

Overall results for each scenario

From the table, it can be understood that many developed TFLite models are well reached in the limited target of 64 KB memory, and only the LSTM structure exceeds this value because of its tokenizer.

Emotion Detection

Regarding detecting whether a sentence in the domain of text has what type of emotion, we developed two different models: BERT and LSTM.

For the BERT model, we utilized the ParsBERT embedding space and defined a Dense-CNN classifier on top of it to be trained using our novel dataset. Below, the process of training and the value of accuracy and loss can be seen:

Accuracy over each iteration of training

The value of loss for each iteration

With the aid of the TFlite converter, we were able to reduce the size of the model to one-fourth of the initial size or, in other words, 162 MB, but this is beyond our limitation of the embedded system, so we developed an LSTM architecture in its place.

LSTM Model to Detect Emotion

For this task, we developed two distinct models, one regular one using a simple tokenizer by defining a dictionary of 2000 maximum vocab, and another using dynamic-learning rate and sparse_categorical_crossentropy loss. This model resulted in 82% and 89% accuracy, respectively, which is a promising result for 640 KB and 161 KB models.

Confusion Matrix of LSTM Architecture

Result for Enhanced Architecture

Furthermore, more detailed information regarding classification is included below:

 precision    recall  f1-score   support

           0       0.88      0.76      0.82       110
           1       0.81      0.84      0.83       102
           2       0.95      0.86      0.90        97
           3       0.90      0.97      0.94       104
           4       0.97      0.96      0.96        95
           5       0.88      0.98      0.93       100
           6       0.99      0.94      0.96       113
           7       0.98      0.92      0.95       104
           8       0.80      0.73      0.76       101
           9       0.99      1.00      1.00       103
          10       0.91      0.92      0.92       100
          11       0.71      0.85      0.77       105

    accuracy                           0.89      1234
   macro avg       0.90      0.89      0.89      1234
weighted avg       0.90      0.89      0.89      1234

Dataset Gathering

One of our works' novel contributions is emotion detection dataset creation. For this dataset, we utilized GPT to generate sentences for each emotion, resulting in more than 6000 Farsi-labeled sentences across 12 different classes. For this reason, we used prompt engineering techniques to make sure that the generated sentences were valid and also unique.

Anomaly Detection

To understand anomaly behaviors, we developed and trained two distinct architectures, one using a random forest tree and another using FC-AutoEncoder. The random forest tree classifier showed a promising result of 99% in this task; however, its model size exceeds 16 MB, which is beyond the limit to be run on the STM32 chipset. The result of anomaly detection is shown below:

Result for Anomaly Detection

Using FC-AutoEncoder, we achieved acceptable results, which is shown below:

Result for Anomaly Detection Using FC-AutoEncoder

Nevertheless, as shown in the training process, due to the limited size of the model, it is not capable of understanding the meaning and relation between labels and dense feature space.

Limited model size restricts learning, causing minimal loss improvement

But its 41 KB size makes it manageable to work with in embedded systems.

Image Classification

Overview

Image Classification involves training a model to categorize images into predefined classes using deep learning models like Convolutional Neural Networks (CNNs).

Steps

  1. Model Training
  • Prepare the Dataset: Download and preprocess the dataset (e.g., CIFAR-10, ImageNet). Split into training, validation, and test sets.
  • Define the Model: Use CNNs or pre-trained models like ResNet or VGG.
  • Train the Model: Compile and fit the model to the dataset.
  • Evaluate the Model: Test the model using accuracy, precision, and recall metrics.
  1. Model Evaluation

Evaluate the trained model's performance on a test dataset using metrics like accuracy.

  1. Model Inference

Use the trained model to classify new images.

Keyword Spotting

Overview

Keyword Spotting detects specific words or phrases in an audio stream, typically using features like MFCCs and models like CNNs.

Steps

  1. Dataset Preparation
  • Download the Dataset: Use the Google Speech Commands dataset.
  • Preprocess the Data: Convert audio to features (MFCC, LFBE, or raw samples).
  • Split the Data: Divide into training, validation, and test sets.
  1. Model Training
  • Define the Model: Use CNN or RNN architectures.
  • Train the Model: Compile and train the model using the prepared dataset.
  • Save the Model: Save the trained model for later use.
  1. Model Evaluation

Evaluate the model on the test set using accuracy, precision, recall, and AUC.

  1. Quantization and Inference
  • Quantize the Model: Convert the model to TensorFlow Lite format for efficient deployment.
  • Evaluate the Quantized Model: Ensure the quantized model performs well.
  • Run Inference: Use the model for real-time keyword detection.

Simulation Process

For the simulation process, we utilized the TFLite model to get the result. We analyzed and monitored the memory usage, CPU time, execution time, and outputs, and we were able to produce results iteratively in each area of measurement. the result is as follows.

The CPU usage during iterations

The Memory Usage

The measured outputs of the TFLite model

Execution time across iterations

Additional Notes

  • Firmware Development:

    • In this project, we used a converted TFLite model (via xxd) embedded as a C header (model_data.h). No external AI libraries are imported in the firmware code.
    • Then in the project’s source files (Runner.c) we referenced model_data.h.
  • Simulation:

    • The Python simulation helps in validating the TFLite model performance before deploying it on hardware.
    • To analyze memory usage, CPU time, and inference latency, we used a python script for each project.

Related Links

Authors

Special Thanks to Ali Salesi.

About

Benchmarking Framework for Machine Learning Algorithms on Embedded Devices

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 5