How to deploy your NN model on STM32MPU

Applicable for

STM32MP13x lines, STM32MP15x lines, STM32MP25x lines

1. Article purpose[edit source]

The main purpose of this article is to give main steps and advice on how to deploy NN models on STM32MPU boards through the X-LINUX-AI expansion package. The X-LINUX-AI is designed to be user-friendly and to facilitate the NN model deployment on all the STM32MPU targets with a common and coherent ecosystem.

2. Deploy NN model on STM32MP2x board[edit source]

This part details the steps to follow in order to deploy the NN model on STM32MP2x boards. The following diagram is a quick visual aid to determine how to deploy a network using the X-LINUX-AI ecosystem. The different steps mentioned in this diagram are detailed below.

2.1. Which type of NN model is used[edit source]

On STM32MP2x, X-LINUX-AI ecosystem supports multiple types of NN models depending on the computation engine targeted. Neural Network computation is available on NPU, GPU and CPU for STM32MP2x targets compared to STM32MP1x where only CPU is available.

The only model type available to address high performance AI applications on NPU/GPU is :

NBG model : this is the precompiled NN model format that can be executed directly on the hardware. This spares the pre-compilation time during the first inference of the model, the common extension for this type of model is .nb

NBG models are obtained in two different ways :

Using STM32AI-MPU tool : which is a free offline compiler tool (running on the host computer) used to optimize and convert a neural network (NN) model to be executed on the STM32MP2x boards. Please refer to the dedicated article which provide all the information necessary to convert a model to NBG format.

Information

STM32AI-MPU tool support as inputs TensorFlow Lite and ONNX models

Using STM32Cube.ai Developer Cloud : which is an online solution to benchmark NN model directly on STM32 targets via the Cloud services. When the benchmark is run on STM32MP2x board the NBG model is automatically generated and can be downloaded.

Information

STM32Cube.ai Developer Cloud support as inputs TensorFlow Lite, Keras and ONNX models

Other supported models type are only targeting CPU or external accelerator like Coral EdgeTPU :

TensorFlow Lite model : the common extension for this type of model is .tflite
Coral EdgeTPU model : this type of model is a derivative of classic TensorFlow Lite model, the extension of the model remain .tflite but the model is pre-compiled for EdgeTPU using a specific compiler. To go further with Coral models please refer to the dedicated wiki article : How to compile model and run inference on Coral Edge TPU
ONNX model : the common extension for this type of model is .onnx

If the model that you want to deploy is not in the above list, it means that the model type is not supported as is and may need conversion. Here is a list of common AI frameworks extension conversion to TensorFlow Lite or ONNX type :

TensorFlow : for TensorFlow saved model with .pb extension, the conversion to a TensorFlow Lite model, could be easily done using TensorFlow Lite converter
Keras : for Keras .h5 file, the conversion to a TensorFlow Lite model, could also be done using TensorFlow Lite converter as Keras is part of TensorFlow since 2017
Pytorch : for typical Pytorch model .pt, it is possible to directly export a ONNX model using the Pytorch built-in function torch.onnx.export. It is not possible to directly export a TensorFlow Lite model but it is possible to convert ONNX model to TensorFlow Lite model using packages like onnx-tf or onnx2tf
Ultralitics Yolo : Ultralitics provide a build-in function to export YoloVx models with several formats such as ONNX and TensorFlow Lite

Information

Note that it is also possible to convert a ONNX model to TensorFlow Lite using package like onnx-tf or vice versa using tf2onnx

2.2. Which quantization type is used[edit source]

The most important point is to determine if the model to execute on target is quantized or not. Generally, common AI frameworks like TensorFlow, ONNX, Pytorch use 32-bit floating point representation during the training phase of the model which is optimized for modern GPUs and CPUs but not for embedded devices.

On STM32MP2 series' boards , the GPU/NPU is a common IP where behavior depends on the quantization scheme during model optimization process :

8-bits per-tensor : is the recommended quantization scheme to achieve the best performances on STM32MP2 series' boards . Most of the NN model layers in per-tensor mode are executed on the NPU. A minority of layers are executed on the GPU.
8-bits per-channel : is not supported on NPU, in this case most of the NN model layers are executed on the GPU and only few operations are executed on the NPU. It is important to mention that, even if the model is in per-channel, it will be accelerated using the GPU with less performances compared to per-tensor. However, depending on use cases, performances could be sufficient.
float-16 : is not supported on NPU, in this case most of the NN model layers are executed the GPU and only few operations are executed on NPU. This type of quantization is not recommended for STM32MP2 series' boards hardware acceleration.
Non quantized model (float-32) : Non quantized model are not supported as is on GPU/NPU, quantization is needed to get the best out of the IP.

To determine if a model is quantized, the most convenient way is to use a tool like Netron, which is a visualizer for neural network models. For each layer of the NN, the data type is mentioned (float32, int8, uint8 ...) but also the quantization type and the quantization parameters ... If the data type of internal layers (excepted inputs and outputs layers) are in 8-bits or lower it means that the model is quantized.

For TensorFlow Lite and ONNX models, as they are running on CPU it is possible to run a non quantized model but with slow performances. Even for CPU execution it is highly recommended to perform a 8-bits quantization. A 8-bits quantized model will run faster with in most cases an acceptable accuracy loss.

To quantize a model with post-training quantization,TensorFlow Lite converter and ONNX Runtime frameworks provide all the necessary to perform such quantization directly on host PC, the documentation can be found on their website.

To sum up the important information on quantization:

If the model to deploy on target is not quantized, it is necessary to perform a quantization ( post-training or aware training).
The quantization type is very important. To get the best performances of the GPU/NPU IP, the model should be quantized in per-tensor 8-bits.
Once the TensorFlow Lite or the ONNX model is quantized, it is necessaty to convert it in NBG format using STM32MPU-AI offline compiler tool or STM32Cube.AI Developper Cloud.

Information

Note that it is not possible to quantize a model directly on target.

2.3. Deploy the model on target[edit source]

Once the model is in NBG format and optimized for embedded deployment, the next step is to perform a benchmark on target using X-LINUX-AI unified benchmark to validate the good behavior of the model. To do it please refer to the dedicated article : How to benchmark your NN model on STM32MPU

To go further, with developing an AI application based on this model using TensorFlow Lite runtime or ONNX runtime please refer to application example wiki articles : AI - Application examples

3. Deploy NN model on STM32MP1x board[edit source]

This part is dedicated to detail steps to follow to deploy NN model on STM32MP1x board.

3.1. Which type of NN model is used[edit source]

On STM32MP1x, X-LINUX-AI ecosystem only support three types of NN models which are :

TensorFlow Lite model : the common extension for this type of model is .tflite
Coral EdgeTPU model : this type of model is a derivative of classic TensorFlow Lite model, the extension of the model remain .tflite but the model is pre-compiled for EdgeTPU using a specific compiler. To go further with Coral models please refer to the dedicated wiki article : How to compile model and run inference on Coral Edge TPU
ONNX model : the common extension for this type of model is .onnx

If the model that you want to deploy is not in the above list, it means that the model type is not supported as is and may need conversion. Here is a list of common AI frameworks extension conversion to TensorFlow Lite or ONNX type :

TensorFlow : for TensorFlow saved model with .pb extension, the conversion to a TensorFlow Lite model, could be easily done using TensorFlow Lite converter
Keras : for Keras .h5 file, the conversion to a TensorFlow Lite model, could also be done using TensorFlow Lite converter as Keras is part of TensorFlow since 2017
Pytorch : for typical Pytorch model .pt, it is possible to directly export a ONNX model using the Pytorch built-in function torch.onnx.export. It is not possible to directly export a TensorFlow Lite model but it is possible to convert ONNX model to TensorFlow Lite model using packages like onnx-tf or onnx2tf
Ultralitics Yolo : Ultralitics provide a build-in function to export YoloVx models with several formats such as ONNX and TensorFlow Lite

Information

Note that it is also possible to convert a ONNX model to TensorFlow Lite using package like onnx-tf or vice versa using tf2onnx

3.2. Which quantization type is used[edit source]

The most important point is to determine if the model to execute on target is quantized or not. Generally, common AI frameworks like TensorFlow, ONNX, Pytorch use 32-bit floating point representation during the training phase of the model which is optimized for modern GPUs and CPUs but not for embedded devices.

To determine if a model is quantized, the most convenient way is to use a tool like Netron, which is a visualizer for neural network models. For each layer of the NN, the data type is mentioned (float32, int8, uint8 ...) but also the quantization type and the quantization parameters ... If the data type of internal layers (excepted inputs and outputs layers) are in 8-bits or lower it means that the model is quantized.

Float-32 models can be run on the CPU of STM32MP1x using TensorFlow Lite or ONNX Runtime but the performances will be very slow. It is highly recommended to perform a 8-bits quantization. A 8-bit quantized model will run faster with in most cases an acceptable accuracy loss..

To quantize a model with post-training quantization,TensorFlow Lite converter and ONNX Runtime frameworks provide all the necessary to perform such quantization directly on host PC, the documentation can be found on their website.

Information

Note that it is not possible to quantize a model directly on target.

3.3. Deploy the model on target[edit source]

Once the model is in TensorFlow Lite or ONNX format and optimized for embedded deployment, the next step is to perform a benchmark on target using X-LINUX-AI unified benchmark to validate the good functioning of the model. To do it please refer to the dedicated article : How to benchmark your NN model on STM32MPU

To go further, with developing an AI application based on this model using TensorFlow Lite runtime or ONNX runtime please refer to application example wiki articles : AI - Application examples