How to benchmark your NN model on STM32MPU

Revision as of 18:09, 28 June 2024 by Registered User (→‎On STM32MP1x board)
Applicable for STM32MP13x lines, STM32MP15x lines, STM32MP25x lines


This article describes how to measure the performance of a Neural Network (NN) model on all STM32MPU platforms using the X-LINUX-AI unified benchmark.

1. Description[edit source]

The X-LINUX-AI unified benchmark is a common benchmark application which allows the benchmark of either NBG (Network Binary Graph), TensorFlowTM Lite and ONNXTM models with a unique binary file. The aim of this tool is to simplify the NN model performance evaluation on STM32MPU platforms.

The model type (NBG, TFLite or ONNXTM) is abstracted using a high-level common API. In concrete terms, it is possible to benchmark any supported model type with a unique command. This makes it possible to benchmark a complete directory containing different types of models and compare them.

The X-LINUX-AI unified benchmark provides several options and useful information, which are detailed below, to easily compare models and determine whether a model is correctly optimized to run on the current target.

2. Installation[edit source]

2.1. Installing from the OpenSTLinux AI package repository[edit source]

Warning white.png Warning
The software package is provided AS IS, and by downloading it, you agree to be bound to the terms of the software license agreement (SLA0048). The detailed content licenses can be found here.

After configuring the AI OpenSTLinux package, proceed to the installation of X-LINUX-AI components for this application.

The minimum package required is:

 x-linux-ai -i x-linux-ai-benchmark

3. How to use the X-LINUX-AI unified benchmark tool[edit source]

3.1. Executing with the command line[edit source]

The x-linux-ai-benchmark tool binary is located in the userfs partition: /usr/bin/x-linux-ai-benchmark

It can therefore be accessed from anywhere in the file system using the following command:

 x-linux-ai-benchmark

It accepts the following input parameters:

Usage: x-linux-ai-benchmark [-h] (-d MODELS_DIRECTORY | -m MODEL_PATH) [--cpu_cores CPU_CORES]
                            [--minimal_serial] [--export_json]

options:
  -h, --help            show this help message and exit
  -d MODELS_DIRECTORY, --models_directory MODELS_DIRECTORY
                        specify path to models directory that need to be tested without last /
  -m MODEL_PATH, --model_path MODEL_PATH     
                        specify path to model that need to be tested
  --cpu_cores CPU_CORES
                        number of CPU cores used for the benchmark, by default the benchmark automatically
                        detect the maximum of CPU cores available
  --minimal_serial      use this option to display result on a serial terminal
  --export_json         use this option to export result in json

The X-LINUX-AI unified benchmark is designed to be as simple as possible. There is only one option which is mandatory to run the benchmark which must be chosen from the two following exclusive arguments:

  • -m, --model_path: This option is used to specify the path to the NN model to be tested.
  • -d, --models_directory: This option is used to benchmark several models contained in a same directory. Note that model type can be mixed in the directory. The unified benchmark parses files in the directory and skips all files that are not NN models with a known extension type.

Concerning the execution engine used to run the benchmark, the unified benchmark automatically selects the best possible solution, depending on the board and the model type used:

  • For STM32MP2 series' boards More info green.png, if the model used is a NBG, the benchmark runs on NPU/GPU, otherwise it runs on CPU.
  • For STM32MP1 series' boards More info green.png, the benchmark always runs on CPU.

In both cases, the number of CPU cores used is automatically set to the maximum if the optional argument --cpu_cores is not set. Otherwise, the benchmark uses the specified cores value.

The benchmark also provides two more convenient options:

  • --export_json: This option can be used to export the benchmark results to a JSON file named "x-linux-ai-benchmark-results.json". This JSON file is composed of a JSON class named "board_information" containing all the board configuration information, and a JSON class for each model tested.
  • --minimal_serial: The benchmark uses some graphic libraries to format outputs. When using serial links, the formatting may not render correctly, so a lighter version is available with this option.

Depending on the type of model used, benchmark outputs can be composed of tables.
The first table displays the characteristics of the board used for the benchmark.

  • Some of these characteristics are common for STM32MP1 series' boards More info green.png and STM32MP2 series' boards More info green.png : the X-LINUX-AI version, the board name, the number of CPU cores available, and the CPU frequency.
  • More categories are available specifically for STM32MP2 series' boards More info green.png: GPU/NPU driver version, and GPU/NPU frequency.

The second table summarizes the relevant information on the reference models.

  • Inference time refers to the amount of time it takes for a machine learning model to process input data and produce an output prediction. In this case, millisecond is the metric used.
  • CPU, GPU, NPU, CORAL_TPU % refers to the percentage of each execution engine used for the inference.
    • For STM32MP2 series' boards More info green.png, all the execution engines are available.
    • For STM32MP1 series' boards More info green.png, only CPU and Coral Edge TPUTM are available, this is why the mention "NA" is displayed for GPU and NPU.
  • Peak RAM refers to the maximum amount of RAM memory necessary on the target to execute an inference of a specific NN model.

On STM32MP2 series' boards More info green.png, the non optimal model table could additionally be displayed. As its name suggests, the models that are not correctly optimized for STM32MP2x target are stored in this table. If your model appears in this list, it means that your model is not quantized, or quantized with an unsupported quantization scheme like per-channel. In such case, refer to the article How to deploy your NN model on STM32MPU.

The example below contains a non optimal model table :

+--------------------------------------------------------------------------------------------+
|                                    NBG models benchmark                                    |
+------------------------------+---------------------+-------+-------+-------+---------------+
|          Model Name          | Inference Time (ms) | CPU % | GPU % | NPU % | Peak RAM (MB) |
+------------------------------+---------------------+-------+-------+-------+---------------+
| movenet_singlepose_lightning |        65.23        |  0.0  | 93.76 |  6.24 |       NA      |
+------------------------------+---------------------+-------+-------+-------+---------------+
+--------------------------------------------------------------------------------+
|                               Non-Optimal models                               |
+------------------------------+-------------------------------------------------+
|          model name          |                     comments                    |
+------------------------------+-------------------------------------------------+
| movenet_singlepose_lightning | GPU usage is 93.76% compared to NPU usage 6.24% |
|                              | please verify if the model is quantized or that |
|                              | the quantization scheme used is the 8-bits per- |
|                              |                      tensor                     |
+------------------------------+-------------------------------------------------+

4. How to benchmark a single model[edit source]

4.1. On STM32MP2x board[edit source]

For this demonstration, the NN model used is yolov8n_256_quant_pt_uf_pose_coco-st.nb, which is a YoloV8n that has been processed and converted to a network binary graph to run on the NPU.

The model used in this example can be installed from the following package:

 x-linux-ai -i pose-estimation-models-yolov8n
Info white.png Information
The same demonstration could be also carried out with TFLiteTM , ONNXTM or Edge TPUTM models

Use the following command to launch the benchmark on a single model:

 x-linux-ai-benchmark -m /usr/local/x-linux-ai/pose-estimation/models/yolov8n_pose/yolov8n_256_quant_pt_uf_pose_coco-st.nb

After running the benchmark, this is the output on the console:

+------------------------------------------------+
|     X-LINUX-AI unified NN model benchmark      |
+----------------------------+-------------------+
|          Machine           |  STM32MP257F-EV1  |
|         CPU cores          |         2         |
|    CPU Clock frequency     |       1.5GHz      |
|  GPU/NPU Driver Version    |  6.4.15.6.691815  |
|  GPU/NPU Clock frequency   |      800 MHZ      |
|    X-LINUX-AI Version      |       v5.1.0      |
+----------------------------+-------------------+
For NBG models, computation engine use for benchmark : NPU running at 800 MHZ
For TFLite and ONNX models, computation engine use for benchmark : CPU with 2 cores at :  1.5GHz
+----------------------------------------------------------------------------+
|                            NBG models benchmark                            |
+--------------+---------------------+-------+-------+-------+---------------+
|  Model Name  | Inference Time (ms) | CPU % | GPU % | NPU % | Peak RAM (MB) |
+--------------+---------------------+-------+-------+-------+---------------+
| yolov8n-pose |        15.46        |  0.0  | 13.69 | 86.31 |       NA      |
+--------------+---------------------+-------+-------+-------+---------------+

The first table is dedicated to target information, and the second is dedicated to benchmark results.

4.2. On STM32MP1x board[edit source]

For this demonstration, the NN model mobilenet_v1_0.5_128_quant.tflite is used and downloaded from Tensorflow Hub[1]. It is a lite model trained for image classification.

The model used in this example can be installed from the following package:

 x-linux-ai -i img-models-mobilenetv1-05-128
Info white.png Information
The same demonstration could be also carried out with ONNXTM or Edge TPUTM models

To launch the benchmark on a single model use the following command:

  x-linux-ai-benchmark -m /usr/local/x-linux-ai/image-classification/models/mobilenet/mobilenet_v1_0.5_128_quant.tflite

After running the benchmark, this is the output on the console:

+------------------------------------------------+
|     X-LINUX-AI unified NN model benchmark      |
+--------------------------+---------------------+
|         Machine          |   STM32MP157F-DK2   |
|        CPU cores         |          2          |
|   CPU Clock frequency    |        0.8GHz       |
|   X-LINUX-AI Version     |        v5.1.0       |
+--------------------------+---------------------+
Computation engine use for benchmark : CPU with 2 cores at :  0.8GHz
+--------------------------------------------------------------------------+
|                     TensorFlow Lite models benchmark                     |
+----------------------------+---------------------+-------+---------------+
|         Model Name         | Inference Time (ms) | CPU % | Peak RAM (MB) |
+----------------------------+---------------------+-------+---------------+
| mobilenet_v1_0.5_128_quant |        28.31        | 100.0 |     27.37     |
+----------------------------+---------------------+-------+---------------+

The first table is dedicated to target information, and the second is dedicated to benchmark results.

5. How to benchmark multiple models[edit source]

With X-LINUX-AI unified benchmark it is possible to benchmark multiple models which are located in a same directory. With this method you can easily compare the performance of multiple models with multiple architectures and model types.

5.1. On STM32MP2x board[edit source]

For the demonstration we use image classification models. The benchmark runs on NBG, TensorFlowTM Lite, ONNXTM and Coral Edge TPUTM models using all the compute engines available on the board.

The model used in this example can be installed from the following package:

 x-linux-ai -i img-models-mobilenetv2-10-224

Use the following command to launch the benchmark of multiple models stored in the same directory:

  x-linux-ai-benchmark -d /usr/local/x-linux-ai/image-classification/models/mobilenet/

After running the benchmark, this is the output on the console:

+------------------------------------------------+
|     X-LINUX-AI unified NN model benchmark      |
+----------------------------+-------------------+
|          Machine           |  STM32MP257F-EV1  |
|         CPU cores          |         2         |
|    CPU Clock frequency     |       1.5GHz      |
|  GPU/NPU Driver Version    |  6.4.15.6.691815  |
|  GPU/NPU Clock frequency   |      800 MHZ      |
|    X-LINUX-AI Version      |       v5.1.0      |
+----------------------------+-------------------+
For NBG models, computation engine use for benchmark : NPU running at 800 MHZ
For TFLite and ONNX models, computation engine use for benchmark : CPU with 2 cores at :  1.5GHz
model extension : .txt not supported, model skipped => supported extension are : .tflite, .onnx, .nb 
Coral edgetpu not connected skip the model
model extension :  not supported, model skipped => supported extension are : .tflite, .onnx, .nb 
+----------------------------------------------------------------------------------------------------+
|                                        NBG models benchmark                                        |
+--------------------------------------+---------------------+-------+-------+-------+---------------+
|              Model Name              | Inference Time (ms) | CPU % | GPU % | NPU % | Peak RAM (MB) |
+--------------------------------------+---------------------+-------+-------+-------+---------------+
| mobilenet_v2_1.0_224_int8_per_tensor |        11.75        |  0.0  |  6.98 | 93.02 |       NA      |
+--------------------------------------+---------------------+-------+-------+-------+---------------+
+------------------------------------------------------------------------------------+
|                          TensorFlow Lite models benchmark                          |
+--------------------------------------+---------------------+-------+---------------+
|              Model Name              | Inference Time (ms) | CPU % | Peak RAM (MB) |
+--------------------------------------+---------------------+-------+---------------+
| mobilenet_v2_1.0_224_int8_per_tensor |        119.6        | 100.0 |     39.15     |
+--------------------------------------+---------------------+-------+---------------+
+------------------------------------------------------------------------------------+
|                               ONNX models benchmark                                |
+--------------------------------------+---------------------+-------+---------------+
|              Model Name              | Inference Time (ms) | CPU % | Peak RAM (MB) |
+--------------------------------------+---------------------+-------+---------------+
| mobilenet_v2_1.0_224_int8_per_tensor |        178.57       | 100.0 |      44.0     |
+--------------------------------------+---------------------+-------+---------------+

Benchmark results on multiple models are classified in different tables, depending on the model type. A table is dedicated to NBG, TensorFlowTM Lite, Coral Edge TPUTM and ONNXTM models. As mentioned earlier in this article, a "non optimal model" table is displayed with a model that is not quantized or quantized in per-channel. For further information on these specifics points, refer to the article How to deploy your NN model on STM32MPU.

Info white.png Information
Files that are not NN models and that are present in the benchmarked directory are skipped, with a log in the console

5.2. On STM32MP1x board[edit source]

For the demonstration we use image classification models. The benchmark runs on TensorFlowTM Lite, ONNXTM and Coral Edge TPUTM models.

The models used in this example can be installed from the following package:

 x-linux-ai -i img-models-mobilenetv1-05-128

Use the following command to launch the benchmark of multiple models stored in the same directory:

  x-linux-ai-benchmark -d /usr/local/x-linux-ai/image-classification/models/mobilenet/

After running the benchmark, this is the output on the console:

X-LINUX-AI unified benchmark multiple models console output MP1x

Benchmark results on multiple models are classified in different tables, depending on the model type. One table is dedicated to TensorFlowTM Lite models, a second for Coral Edge TPUTM models and the last one for ONNXTM models.

Info white.png Information
If there are files, that are not NN models in the benchmarked directory, files just will be skipped with a log in the console

6. How to export benchmark results[edit source]

Exporting benchmark results is very simple: Use the optional argument --export_json. A JSON file is generated at the end of the benchmark named x-linux-ai-benchmark-result.json, and located in the directory where the benchmark was executed.

The JSON result file is built around different structures:

  • One dedicated to the board information:
    "board_information": {
        "name": "STM32MP257",
        "nb_cpu_core": 2,
        "cpu clock": 1500000000.0,
        "gpu version": "6.4.15.6.691815",
        "gpu clock": 800000000
    },
  • One structure per model tested:
    "mobilenet_v2_1.0_224_int8_per_tensor_nbg": {
        "nn_name": "mobilenet_v2_1.0_224_int8_per_tensor",
        "model_type": "nbg",
        "execution_engine": "gpu/npu",
        "cpu_core_used": "2",
        "inference_time": 11.74,
        "cpu_usage": 0.0,
        "gpu_usage": 6.81,
        "gpu_layer_list": [
            "DepthwiseConvLayer",
            "Softmax2Layer"
        ],
        "npu_usage": 93.19,
        "npu_layer_list": [
            "TensorTranspose",
            "ConvolutionReluPoolingLayer2",
            "FullyConnectedReluLayer",
            "TensorCopy"
        ],
        "ram_usage": "NA",
        "macc_usage": "NA"
    },
    "mobilenet_v2_1.0_224_int8_per_tensor_onnx": {
        "nn_name": "mobilenet_v2_1.0_224_int8_per_tensor",
        "model_type": "onnx",
        "execution_engine": "cpu",
        "cpu_core_used": "2",
        "inference_time": 177.94,
        "cpu_usage": 100.0,
        "gpu_usage": "NA",
        "gpu_layer_list": [
            "NA"
        ],
        "npu_usage": "NA",
        "npu_layer_list": [
            "NA"
        ],
        "ram_usage": "44228608",
        "macc_usage": "NA"
    },
    "mobilenet_v2_1.0_224_int8_per_tensor_tflite": {
        "nn_name": "mobilenet_v2_1.0_224_int8_per_tensor",
        "model_type": "tflite",
        "execution_engine": "cpu",
        "cpu_core_used": "2",
        "inference_time": 119.77,
        "cpu_usage": 100.0,
        "gpu_usage": "NA",
        "gpu_layer_list": [
            "NA"
        ],
        "npu_usage": "NA",
        "npu_layer_list": [
            "NA"
        ],
        "ram_usage": 37902300,
        "macc_usage": "NA"
    }

If multiple models are tested, each model tested have a dedicated structure with benchmark results information. The information listed in each structure may vary depending on the model type and the target used.

7. Going further[edit source]

The X-LINUX-AI benchmark is built on top of the common NBG, TensorFLowTM Lite, Coral and ONNXTM benchmark available in X-LINUX-AI expansion package. All the options provided in those benchmark utilities are not available in the unified benchmark with the aim of keeping things simple.

To go further on a specific benchmark, refer to the following articles: