How to benchmark your NN model on STM32MPU

Applicable for

STM32MP13x lines, STM32MP15x lines, STM32MP25x lines

This article describes how to measure the performance of a Neural Network model on all STM32MPU platforms using the X-LINUX-AI unified benchmark.

1. Description[edit source]

The X-LINUX-AI unified benchmark is a common benchmark application which allows the benchmark of either NBG (Network Binary Graph), TensorFlow Lite and ONNX models with a unique binary file. The aim of this tool is to simplify the NN model performance evaluation on STM32MPU platforms.

The model type (NBG, TFLite or ONNX) is abstracted using a high-level common API, in concrete terms, it is possible with a unique command to benchmark any supported model type. This makes it possible to benchmark a complete directory containing different types of models and compare them with each other.

The X-LINUX-AI unified benchmark provides several options and useful information, detailed below, to easily compare models and determine whether the model is correctly optimized to run on the current target.

2. Installation[edit source]

2.1. Installing from the OpenSTLinux AI package repository[edit source]

Warning

The software package is provided AS IS, and by downloading it, you agree to be bound to the terms of the software license agreement (SLA0048). The detailed content licenses can be found here.

After having configured the AI OpenSTLinux package you can install X-LINUX-AI components for this application.

The minimum package required is:

 apt-get install x-linux-ai-benchmark

3. How to use the X-LINUX-AI unified benchmark tool[edit source]

3.1. Executing with the command line[edit source]

The x-linux-ai-benchmark tool binary is located in the userfs partition: /usr/bin/x-linux-ai-benchmark

It can therefore be accessed from anywhere in the file system using the following command:

 x-linux-ai-benchmark

It accepts the following input parameters:

Usage: x-linux-ai-benchmark [-h] (-d MODELS_DIRECTORY | -m MODEL_PATH) [--cpu_cores CPU_CORES]
                            [--minimal_serial] [--export_json]

options:
  -h, --help            show this help message and exit
  -d MODELS_DIRECTORY, --models_directory MODELS_DIRECTORY
                        specify path to models directory that need to be tested without last /
  -m MODEL_PATH, --model_path MODEL_PATH     
                        specify path to model that need to be tested
  --cpu_cores CPU_CORES
                        number of CPU cores used for the benchmark, by default the benchmark automatically
                        detect the maximum of CPU cores available
  --minimal_serial      use this option to display result on a serial terminal
  --export_json         use this option to export result in json

The X-LINUX-AI unified benchmark is designed to be as simple as possible. There is only one option which is mandatory to run the benchmark which must be chosen from the two following exclusive arguments :

-m, --model_path : This option is used to specify the path to the NN model to be tested.
-d, --models_directory : This option is used to benchmark several models contained in a same directory. Note that model type can be mixed in the directory. The unified benchmark parses files in the directory and skipped all files that are not NN models with known extension type.

Concerning the execution engine used to run benchmark, depending on the board and the model type used, the unified benchmark automatically selects the best solution possible :

For STM32MP2 series' boards , if the model used is a NBG, the benchmark runs on NPU/GPU, otherwise it runs on CPU.
For STM32MP1 series' boards , the benchmark always runs on CPU.

In both cases, the number of CPU cores used is automatically set to the maximum if the optional argument --cpu_cores is not set. Otherwise, the benchmark uses the specified cores value.

The benchmark also provides two more convenient options :

--export_json : This option can be used to export the benchmark results to a JSON file named "x-linux-ai-benchmark-results.json". This JSON file is composed of a JSON class named "board_information" containing all the board configuration information, and a JSON class for each model tested.
--minimal_serial : The benchmark uses some graphic libraries to format outputs. When using serial links, the formatting may not render correctly, so a lighter version is available with this option.

Benchmark outputs are composed of tables depending on the type of model used.
The first table displays the characteristics of the board used for the benchmark.

Some of these characteristics are common for STM32MP1 series' boards and STM32MP2 series' boards : X-LINUX-AI version, board name, number of CPU cores available, CPU frequency.
More categories are available specifically for STM32MP2 series' boards : GPU/NPU driver version, and GPU/NPU frequency.

The second table summarizes the relevant information on the reference models.

Inference time: Refers to the amount of time it takes for a machine learning model to process input data and produce an output prediction. In this case the metrics used is the millisecond.
CPU, GPU, NPU, CORAL_TPU %: Refers to the percentage of each execution engine used for the inference.
- For STM32MP2 series' boards , all the execution engine are available.
- For STM32MP1 series' boards , only CPU and Coral EdgeTPU are available, this is why the mention "NA" is displayed for GPU and NPU.
Peak RAM: Refers to the maximum amount of RAM memory necessary on the target to execute an inference of a specific NN model.

Moreover, on STM32MP2 series' boards board, another table could be displayed which is the Non optimal model table. As its name suggests, the models that are not correctly optimized for MP2x target will be stored in this table. If your model appears in this list that means that your model is not quantized, or quantized with a none supported quantization scheme like per-channel. In such case please refer to the article How to deploy your NN model on STM32MPU.

Here is an example of the Non optimal model table :

X-LINUX-AI unified benchmark non optimal table

4. How to benchmark a single model[edit source]

4.1. On STM32MP2x board[edit source]

For the demonstration, the NN model used will be mobilenet_v3_large_100_224_quant.nb, which is a MobilenetV3 Large that has been processed and converted to a network binary graph to run on the NPU. It is a lite model trained for image classification.

The model used in this example can be installed from the following package:

 apt-get install nbg-models-mobilenetv3

Information

The same demonstration could be also carried out with TFLite, ONNX or edgeTPU models

To launch the benchmark on a single model use the following command :

  x-linux-ai-benchmark -m /usr/local/demo-ai/image-classification/models/mobilenet/mobilenet_v3_large_100_224_quant.nb

After running the benchmark, here is the output on the console :

X-LINUX-AI unified benchmark single model console output MP2x

The first table is dedicated to target information, and the second is dedicated to benchmark results.

4.2. On STM32MP1x board[edit source]

For the demonstration, the NN model used will be the mobilenet_v1_0.5_128_quant.tflite downloaded from Tensorflow Hub^[1]. It is a lite model trained for image classification.

The model used in this example can be installed from the following package:

 apt-get install tflite-models-mobilenetv1

Information

The same demonstration could be also carried out with ONNX or edgeTPU models

To launch the benchmark on a single model use the following command :

  x-linux-ai-benchmark -m /usr/local/demo-ai/image-classification/models/mobilenet/mobilenet_v1_0.5_128_quant.tflite

After running the benchmark, here is the output on the console :

X-LINUX-AI unified benchmark single model console output MP1x

The first table is dedicated to target information, and the second is dedicated to benchmark results.

5. How to benchmark multiple models[edit source]

With X-LINUX-AI unified benchmark it is possible to benchmark multiple models which are located in a same directory. With this method you can easily compared performances of multiple models with multiple architectures and model types.

5.1. On STM32MP2x board[edit source]

For the demonstration, we will use image classification models. The benchmark will be run on NBG, TensorFlow Lite, ONNX and Coral edgeTPU models using all the compute engines available on the board.

The model used in this example can be installed from the following package:

 apt-get install nbg-models-mobilenetv3
 apt-get install tflite-models-mobilenetv3
 apt-get install onnx-models-mobilenetv3

To launch the benchmark of multiple models stored in the same directory use the following command :

  x-linux-ai-benchmark -d /usr/local/demo-ai/image-classification/models/mobilenet

After running the benchmark, here is the output on the console :

Benchmark results on multiple models are classified in different tables depending on the model type. A table is dedicated to NBG, TensorFlow Lite , Coral edgeTPU and ONNX models. As mentioned earlier in this article, a "Non optimal model" table is displayed with model that are not quantized or quantized in per-channel. To have more information on these specifics points, please refers to How to deploy your NN model on STM32MPU article.

Information

If there are files, that are not NN models in the benchmarked directory, files just will be skipped with a log in the console

5.2. On STM32MP1x board[edit source]

For the demonstration, we will use image classification models. The benchmark will be run on TensorFlow Lite, ONNX and Coral edgeTPU models.

The models used in this example can be installed from the following package:

 apt-get install tflite-models-mobilenetv1
 apt-get install onnx-models-mobilenetv1

To launch the benchmark of multiple models stored in the same directory use the following command :

  x-linux-ai-benchmark -d /usr/local/demo-ai/image-classification/models/mobilenet

After running the benchmark, here is the output on the console :

X-LINUX-AI unified benchmark multiple models console output MP1x

Benchmark results on multiple models are classified in different tables depending on the model type. One table is dedicated to TensorFlow Lite models, a second for Coral edgeTPU models and the last one for ONNX models.

Information

If there are files, that are not NN models in the benchmarked directory, files just will be skipped with a log in the console

6. How to export benchmark results[edit source]

Exporting benchmark results is very simple, the only things to do is to use the optional argument --export_json. A JSON file will be generated at the end of the benchmark named x-linux-ai-benchmark-result.json located in the current directory, where the benchmark was executed.

The JSON result file is built around different structures :

One dedicated to the board information :

    "board_information": {
        "name": "STM32MP257",
        "nb_cpu_core": 2,
        "cpu clock": 1500000000.0,
        "gpu version": "6.4.15.6.691815",
        "gpu clock": 800000000
    }

One structure per model tested :

    "mobilenet_v3_large_100_224_quant_nbg_dict": {
        "nn_name": "mobilenet_v3_large_100_224_quant",
        "model_type": "nbg",
        "execution_engine": "gpu/npu",
        "cpu_core_used": "2",
        "inference_time": 17.08,
        "cpu_usage": 0.0,
        "gpu_usage": 6.41,
        "gpu_layer_list": [],
        "npu_usage": 93.59,
        "npu_layer_list": [],
        "ram_usage": "NA",
        "macc_usage": "NA"
    }

If multiple models are tested, each model tested have a dedicated structure with benchmark results information. The information listed in each structure may vary depending on the model type and the target used.

7. Going further[edit source]

The X-LINUX-AI benchmark is built on top of the common NBG, TensorFLow Lite, Coral and ONNX benchmark available in X-LINUX-AI expansion package. All the options provided in those benchmark utilities are not available in the unified benchmark with the aim of keeping things simple.

To go further on a specific benchmark please refer to the following articles :

For NBG benchmark : How to measure the performance of NBG-based models
For TFLite benchmark : How to measure performance of your NN models using TensorFlow Lite runtime
For ONNX benchmark : How to measure performance of your models using ONNX Runtime
For Coral EdgeTPU benchmark : How to measure performance of your NN models using the Coral Edge TPU

↑ TensorFlow Hub

[tflite_hub_url-1] TensorFlow Hub

[1]