How to deploy your NN model on STM32MPU

Applicable for STM32MP13x lines, STM32MP15x lines, STM32MP25x lines


1. Article purpose[edit source]

The main purpose of this article is to give main steps and advice on how to deploy Neural Networks (NN) models on STM32MPU boards through the X-LINUX-AI expansion package. The X-LINUX-AI is designed to be user-friendly and to facilitate the NN model deployment on all the STM32MPU targets with a common and coherent ecosystem.

2. Deploying a NN model on STM32MP2x boards[edit source]

This part details the steps to follow in order to deploy a NN model on STM32MP25x lines More info.png. The following diagram is a quick visual aid to determine how to deploy a network using the X-LINUX-AI ecosystem. The different steps mentioned in this diagram are detailed below.

How to deploy NN model using X-LINUX-AI ecosystem

2.1. Defining the type of NN model[edit source]

On STM32MP2x, the X-LINUX-AI ecosystem supports multiple types of NN models, depending on the computation engine targeted. Neural Network computation is available on NPU, GPU, and CPU for STM32MP2x targets, compared to STM32MP1x where only CPU is available.

The only model type available to address high performance AI applications on NPU/GPU is the NBG model:

  • This is the precompiled NN model format that can be executed directly on the hardware. This spares the pre-compilation time during the first inference of the model. The common extension for this type of model is .nb.

NBG models are obtained in two different ways:

  • Using the ST Edge AI Core, which is a free offline compiler tool (running on the host computer) used to optimize and convert a neural network (NN) model to be executed on the STM32MP2x boards. Refer to the dedicated article providing all the information required to convert a model to NBG format.
Info white.png Information
The ST Edge AI Core component for STM32MP2x boards supports TensorFlowTM Lite and ONNXTM models as inputs.
  • Using ST Edge AI Developer Cloud, which is an online solution to benchmark the NN model directly on STM32 targets via the Cloud services. When the benchmark is run on the STM32MP2x board, the NBG model is automatically generated, and it can be downloaded.
Info white.png Information
ST Edge AI Developer Cloud supports TensorFlowTM Lite, Keras and ONNXTM models as inputs.

Other supported models types target only CPU or external accelerator, like Coral Edge TPUTM:

  • The TensorFlow LiteTM model: the common extension for this type of model is .tflite
  • The Coral Edge TPUTM model: this type of model is a derivative of classic TensorFlow LiteTM model. The extension of the model remains .tflite, but the model is pre-compiled for Edge TPUTM using a specific compiler. To go further with Coral models, refer to the dedicated wiki article : How to compile model and run inference on Coral Edge TPUTM
  • The ONNXTM model: the common extension for this type of model is .onnx.

If the above list does not contain the model that you want to deploy, it means that the model type is not supported as is, and it may need conversion. The list below contains common AI frameworks extension conversions to TensorFlowTM Lite or ONNXTM types:

  • TensorFlowTM: for the TensorFlowTM saved model with .pb extension, the conversion to a TensorFlowTM Lite model could easily be done using TensorFlowTM Lite converter.
  • Keras: for the Keras .h5 file, the conversion to a TensorFlowTM Lite model could also be done using TensorFlowTM Lite converter. Keras is part of TensorFlowTM since 2017.
  • PytorchTM: for the typical PytorchTM model .pt, it is possible to directly export the ONNXTM model using the PytorchTM built-in function torch.onnx.export. It is not possible to directly export a TensorFlowTM Lite model, but it is possible to convert the ONNXTM model to a TensorFlowTM Lite model using packages like onnx-tf or onnx2tf.
  • Ultralitics Yolo : Ultralitics provides a build-in function to export YoloVx models to several formats, such as ONNXTM and TensorFlowTM Lite.
Info white.png Information
It is also possible to convert the ONNXTM model to TensorFlowTM Lite using packages like onnx-tf, or vice versa using tf2onnx.

2.2. Defining the type of quantization[edit source]

The most important point is to determine if the model to execute on target is quantized or not. Generally, common AI frameworks like TensorFlowTM, ONNXTM, PytorchTM use a 32-bit floating point representation during the training phase of the model, which is optimized for modern GPUs and CPUs, but not for embedded devices.

On STM32MP2 series' boards More info green.png, the GPU/NPU is a common IP where behavior depends on the quantization scheme during the model optimization process:

  • 8-bits per-tensor is the recommended quantization scheme to achieve the best performances on STM32MP2 series' boards More info green.png. Most of the NN model layers in per-tensor mode are executed on the NPU. A minority of layers are executed on the GPU.
  • 8-bits per-channel is not supported on NPU. In this case, most of the NN model layers are executed on the GPU, and only few operations are executed on the NPU. It is important to mention that, even if the model is in per-channel, it is accelerated using the GPU with less performances, compared to per-tensor. However, depending on use cases, performances could be sufficient.
  • float-16 is not supported on NPU. In this case most of the NN model layers are executed the GPU, and only a few operations are executed on NPU. This type of quantization is not recommended for STM32MP2 series' boards More info green.png hardware acceleration.
  • Non quantized model (float-32) are not supported as is on GPU/NPU. Quantization is needed to get the best out of the IP.

To determine if a model is quantized, the most convenient way is to use a tool like Netron, which is a visualizer for neural network models. For each layer of the NN, the data type is mentioned (float32, int8, uint8 ...) but also the quantization type and the quantization parameters. If the data type of the internal layers (except for inputs and outputs layers) are 8-bits or lower, it means that the model is quantized.

For TensorFlowTM Lite and ONNXTM models, as they run on CPU, it is possible to run a non-quantized model with slow performances. Even for CPU execution, it is highly recommended to perform 8-bits quantization. A 8-bits quantized model runs faster with, in most cases, an acceptable accuracy loss.

To quantize a model with post-training quantization, TensorFlowTM Lite converter and ONNXTM Runtime frameworks provide all the necessary elements to perform such quantization directly on the host PC. The documentation can be found on their website.

To summarize main information on quantization:

  • If the model to deploy on target is not quantized, it is necessary to perform a quantization (post-training or aware training).
  • The quantization type is very important. To get the best performances of the GPU/NPU IP, the model should be quantized in per-tensor 8-bits.
  • Once the TensorFlowTM Lite or the ONNXTM model is quantized, it is necessaty to convert it to NBG format using the ST Edge AI Core offline compiler tool or ST Edge AI Developer Cloud.
Info white.png Information
A model can not be directly quantized on target.

2.3. Deploy the model on target[edit source]

Once the model is in NBG format and optimized for embedded deployment, the next step is to perform a benchmark on target using the X-LINUX-AI unified benchmark. This validates the correct behavior of the model. To do so, refer to the dedicated article: How to benchmark your NN model on STM32MPU

To go further with developing an AI application based on this model using TensorFlowTM Lite runtime or ONNXTM runtime, refer to the application example wiki articles : AI - Application examples

3. Deploy NN model on STM32MP1x board[edit source]

This part details the steps to follow to deploy the NN model on the STM32MP1x board.

3.1. Which type of NN model is used[edit source]

On STM32MP1x, X-LINUX-AI ecosystem only support three types of NN models which are :

  • TensorFlowTM Lite model : the common extension for this type of model is .tflite
  • Coral Edge TPUTM model : this type of model is a derivative of classic TensorFlowTM Lite model, the extension of the model remain .tflite but the model is pre-compiled for Edge TPUTM using a specific compiler. To go further with Coral models please refer to the dedicated wiki article : How to compile model and run inference on Coral Edge TPUTM
  • ONNXTM model : the common extension for this type of model is .onnx

If the model that you want to deploy is not in the above list, it means that the model type is not supported as is and may need conversion. Here is a list of common AI frameworks extension conversion to TensorFlowTM Lite or ONNXTM type :

  • TensorFlowTM : for TensorFlowTM saved model with .pb extension, the conversion to a TensorFlowTM Lite model, could be easily done using TensorFlowTM Lite converter
  • Keras : for Keras .h5 file, the conversion to a TensorFlowTM Lite model, could also be done using TensorFlowTM Lite converter as Keras is part of TensorFlowTM since 2017
  • PytorchTM : for typical PytorchTM model .pt, it is possible to directly export a ONNXTM model using the PytorchTM built-in function torch.onnx.export. It is not possible to directly export a TensorFlowTM Lite model but it is possible to convert ONNXTM model to TensorFlowTM Lite model using packages like onnx-tf or onnx2tf
  • Ultralitics Yolo : Ultralitics provide a build-in function to export YoloVx models with several formats such as ONNXTM and TensorFlowTM Lite
Info white.png Information
Note that it is also possible to convert a ONNXTM model to TensorFlowTM Lite using package like onnx-tf or vice versa using tf2onnx

3.2. Which quantization type is used[edit source]

The most important point is to determine if the model to execute on target is quantized or not. Generally, common AI frameworks like TensorFlowTM, ONNXTM, PytorchTM use 32-bit floating point representation during the training phase of the model which is optimized for modern GPUs and CPUs but not for embedded devices.

To determine if a model is quantized, the most convenient way is to use a tool like Netron, which is a visualizer for neural network models. For each layer of the NN, the data type is mentioned (float32, int8, uint8 ...) but also the quantization type and the quantization parameters ... If the data type of internal layers (excepted inputs and outputs layers) are in 8-bits or lower it means that the model is quantized.

Float-32 models can be run on the CPU of STM32MP1x using TensorFlowTM Lite or ONNXTM Runtime but the performances will be very slow. It is highly recommended to perform a 8-bits quantization. A 8-bit quantized model will run faster with in most cases an acceptable accuracy loss..

To quantize a model with post-training quantization,TensorFlowTM Lite converter and ONNXTM Runtime frameworks provide all the necessary to perform such quantization directly on host PC, the documentation can be found on their website.

Info white.png Information
Note that it is not possible to quantize a model directly on target.

3.3. Deploy the model on target[edit source]

Once the model is in TensorFlowTM Lite or ONNXTM format and optimized for embedded deployment, the next step is to perform a benchmark on target using X-LINUX-AI unified benchmark to validate the good functioning of the model. To do it please refer to the dedicated article : How to benchmark your NN model on STM32MPU

To go further, with developing an AI application based on this model using TensorFlowTM Lite runtime or ONNXTM runtime please refer to application example wiki articles : AI - Application examples