How to deploy your NN model on STM32MPU

Applicable for

STM32MP13x lines, STM32MP15x lines, STM32MP25x lines

1. Article purpose[edit source]

The main purpose of this article is to give main steps and advice on how to deploy NN models on STM32MPU boards through the X-LINUX-AI expansion package. The X-LINUX-AI is designed to be user-friendly and to facilitate the NN model deployment on all the STM32MPU targets with a common and coherent ecosystem.

2. Deploying a NN model on STM32MP2x boards[edit source]

This part details the steps to follow in order to deploy a NN model on STM32MP2x boards. The following diagram is a quick visual aid to determine how to deploy a network using the X-LINUX-AI ecosystem. The different steps mentioned in this diagram are detailed below.

2.1. Defining the type of NN model[edit source]

On STM32MP2x, the X-LINUX-AI ecosystem supports multiple types of NN models, depending on the computation engine targeted. Neural Network computation is available on NPU, GPU, and CPU for STM32MP2x targets, compared to STM32MP1x where only CPU is available.

The only model type available to address high performance AI applications on NPU/GPU is the NBG model:

This is the precompiled NN model format that can be executed directly on the hardware. This spares the pre-compilation time during the first inference of the model. The common extension for this type of model is .nb.

NBG models are obtained in two different ways:

Using the STM32AI-MPU tool, which is a free offline compiler tool (running on the host computer) used to optimize and convert a neural network (NN) model to be executed on the STM32MP2x boards. Refer to the dedicated article providing all the information required to convert a model to NBG format.

Information

The STM32AI-MPU tool supports TensorFlow Lite and ONNX models as inputs.

Using STM32Cube.ai Developer Cloud, which is an online solution to benchmark the NN model directly on STM32 targets via the Cloud services. When the benchmark is run on the STM32MP2x board, the NBG model is automatically generated, and it can be downloaded.

Information

STM32Cube.ai Developer Cloud supports TensorFlow Lite, Keras and ONNX models as inputs.

Other supported models types target only CPU or external accelerator, like Coral EdgeTPU:

The TensorFlow Lite model: the common extension for this type of model is .tflite
The Coral EdgeTPU model: this type of model is a derivative of classic TensorFlow Lite model. The extension of the model remains .tflite, but the model is pre-compiled for EdgeTPU using a specific compiler. To go further with Coral models, refer to the dedicated wiki article : How to compile model and run inference on Coral Edge TPU
The ONNX model: the common extension for this type of model is .onnx.

If the above list does not contain the model that you want to deploy, it means that the model type is not supported as is, and it may need conversion. The list below contains common AI frameworks extension conversions to TensorFlow Lite or ONNX types:

TensorFlow: for the TensorFlow saved model with .pb extension, the conversion to a TensorFlow Lite model could easily be done using TensorFlow Lite converter.
Keras: for the Keras .h5 file, the conversion to a TensorFlow Lite model could also be done using TensorFlow Lite converter. Keras is part of TensorFlow since 2017.
Pytorch: for typical Pytorch model .pt, it is possible to directly export the ONNX model using the Pytorch built-in function torch.onnx.export. It is not possible to directly export a TensorFlow Lite model, but it is possible to convert the ONNX model to a TensorFlow Lite model using packages like onnx-tf or onnx2tf.
Ultralitics Yolo : Ultralitics provides a build-in function to export YoloVx models to several formats, such as ONNX and TensorFlow Lite.

Information

It is also possible to convert the ONNX model to TensorFlow Lite using packages like onnx-tf, or vice versa using tf2onnx.

2.2. Defining the type of quantization[edit source]

The most important point is to determine if the model to execute on target is quantized or not. Generally, common AI frameworks like TensorFlow, ONNX, Pytorch use a 32-bit floating point representation during the training phase of the model, which is optimized for modern GPUs and CPUs, but not for embedded devices.

On STM32MP2 series' boards , the GPU/NPU is a common IP where behavior depends on the quantization scheme during the model optimization process:

8-bits per-tensor is the recommended quantization scheme to achieve the best performances on STM32MP2 series' boards . Most of the NN model layers in per-tensor mode are executed on the NPU. A minority of layers are executed on the GPU.
8-bits per-channel is not supported on NPU. In this case, most of the NN model layers are executed on the GPU, and only few operations are executed on the NPU. It is important to mention that, even if the model is in per-channel, it is accelerated using the GPU with less performances, compared to per-tensor. However, depending on use cases, performances could be sufficient.
float-16 is not supported on NPU. In this case most of the NN model layers are executed the GPU, and only a few operations are executed on NPU. This type of quantization is not recommended for STM32MP2 series' boards hardware acceleration.
Non quantized model (float-32) are not supported as is on GPU/NPU. Quantization is needed to get the best out of the IP.

To determine if a model is quantized, the most convenient way is to use a tool like Netron, which is a visualizer for neural network models. For each layer of the NN, the data type is mentioned (float32, int8, uint8 ...) but also the quantization type and the quantization parameters. If the data type of the internal layers (except for inputs and outputs layers) are 8-bits or lower, it means that the model is quantized.

For TensorFlow Lite and ONNX models, as they run on CPU, it is possible to run a non-quantized model with slow performances. Even for CPU execution, it is highly recommended to perform 8-bits quantization. A 8-bits quantized model runs faster with, in most cases, an acceptable accuracy loss.

To quantize a model with post-training quantization, TensorFlow Lite converter and ONNX Runtime frameworks provide all the necessary elements to perform such quantization directly on the host PC. The documentation can be found on their website.

To summarize main information on quantization:

If the model to deploy on target is not quantized, it is necessary to perform a quantization (post-training or aware training).
The quantization type is very important. To get the best performances of the GPU/NPU IP, the model should be quantized in per-tensor 8-bits.
Once the TensorFlow Lite or the ONNX model is quantized, it is necessaty to convert it to NBG format using the STM32MPU-AI offline compiler tool or STM32Cube.AI Developper Cloud.

Information

A model can not be directly quantized on target.

2.3. Deploy the model on target[edit source]

Once the model is in NBG format and optimized for embedded deployment, the next step is to perform a benchmark on target using the X-LINUX-AI unified benchmark. This validates the correct behavior of the model. To do so, refer to the dedicated article: How to benchmark your NN model on STM32MPU

To go further with developing an AI application based on this model using TensorFlow Lite runtime or ONNX runtime, refer to the application example wiki articles : AI - Application examples

3. Deploy NN model on STM32MP1x board[edit source]

This part details the steps to follow to deploy the NN model on the STM32MP1x board.

3.1. Which type of NN model is used[edit source]

On STM32MP1x, X-LINUX-AI ecosystem only support three types of NN models which are :

TensorFlow Lite model : the common extension for this type of model is .tflite
Coral EdgeTPU model : this type of model is a derivative of classic TensorFlow Lite model, the extension of the model remain .tflite but the model is pre-compiled for EdgeTPU using a specific compiler. To go further with Coral models please refer to the dedicated wiki article : How to compile model and run inference on Coral Edge TPU
ONNX model : the common extension for this type of model is .onnx

If the model that you want to deploy is not in the above list, it means that the model type is not supported as is and may need conversion. Here is a list of common AI frameworks extension conversion to TensorFlow Lite or ONNX type :

TensorFlow : for TensorFlow saved model with .pb extension, the conversion to a TensorFlow Lite model, could be easily done using TensorFlow Lite converter
Keras : for Keras .h5 file, the conversion to a TensorFlow Lite model, could also be done using TensorFlow Lite converter as Keras is part of TensorFlow since 2017
Pytorch : for typical Pytorch model .pt, it is possible to directly export a ONNX model using the Pytorch built-in function torch.onnx.export. It is not possible to directly export a TensorFlow Lite model but it is possible to convert ONNX model to TensorFlow Lite model using packages like onnx-tf or onnx2tf
Ultralitics Yolo : Ultralitics provide a build-in function to export YoloVx models with several formats such as ONNX and TensorFlow Lite

Information

Note that it is also possible to convert a ONNX model to TensorFlow Lite using package like onnx-tf or vice versa using tf2onnx

3.2. Which quantization type is used[edit source]

The most important point is to determine if the model to execute on target is quantized or not. Generally, common AI frameworks like TensorFlow, ONNX, Pytorch use 32-bit floating point representation during the training phase of the model which is optimized for modern GPUs and CPUs but not for embedded devices.

To determine if a model is quantized, the most convenient way is to use a tool like Netron, which is a visualizer for neural network models. For each layer of the NN, the data type is mentioned (float32, int8, uint8 ...) but also the quantization type and the quantization parameters ... If the data type of internal layers (excepted inputs and outputs layers) are in 8-bits or lower it means that the model is quantized.

Float-32 models can be run on the CPU of STM32MP1x using TensorFlow Lite or ONNX Runtime but the performances will be very slow. It is highly recommended to perform a 8-bits quantization. A 8-bit quantized model will run faster with in most cases an acceptable accuracy loss..

To quantize a model with post-training quantization,TensorFlow Lite converter and ONNX Runtime frameworks provide all the necessary to perform such quantization directly on host PC, the documentation can be found on their website.

Information

Note that it is not possible to quantize a model directly on target.

3.3. Deploy the model on target[edit source]

Once the model is in TensorFlow Lite or ONNX format and optimized for embedded deployment, the next step is to perform a benchmark on target using X-LINUX-AI unified benchmark to validate the good functioning of the model. To do it please refer to the dedicated article : How to benchmark your NN model on STM32MPU

To go further, with developing an AI application based on this model using TensorFlow Lite runtime or ONNX runtime please refer to application example wiki articles : AI - Application examples