How to measure the performance of NBG-based models

Applicable for

This article describes how to measure the performance of a network binary graph (NBG) generated from your neural network (NN) model with ST Edge AI Core using the NBG benchmark tool on STM32MP2 series' boards .

Information

If you encounter any difficulties to generate the network binary graph from your neural network model, you may want to take a look at ST_Edge_AI:_Guide_for_MPU.

1. Description[edit | edit source]

The NBG benchmark tool allows you to run performance measurements on your network binary graph after generating them from your NN model. The main benchmark metrics are:

Inference time: refers to the amount of time it takes for a trained deep learning model to make predictions or decisions based on new input data.
MAC utilization: refers to the percentage of MAC (multiply accumulate) units used on the NPU by the network binary graph. This metric indicates better whether you are taking the best of your NPU hardware accelerator in terms of computing capacity.

2. Installation[edit | edit source]

2.1. Installing from the OpenSTLinux AI package repository[edit | edit source]

Warning

The software package is provided AS IS, and by downloading it, you agree to be bound to the terms of the software license agreement (SLA0048). The detailed content licenses can be found here.

After having configured the AI OpenSTLinux package you can install X-LINUX-AI components for this application:

 x-linux-ai -i nbg-benchmark

3. How to use the NBG benchmark tool[edit | edit source]

3.1. Executing with the command line[edit | edit source]

The nbg_benchmark tool binary is located in the userfs partition:

/usr/local/bin/nbg-benchmark-*/tools/nbg_benchmark

It accepts the following input parameters:

Usage: /usr/local/bin/nbg-benchmark-*/tools/nbg_benchmark  -m <nbg_file .nb> -i <input_file .tensor/.txt> -c <int case_mmac> -l <int nb_loops>

-m --nb_file <.nb file path>:               .nb network binary file to be benchmarked.
-i --input_file <.tensor/.txt/ file path>:  Input file to be used for benchmark (maximum 32 input files).
-c --case_mmac <int>:                        Theorical value of MMAC (Million Multiply Accumulate) of the model.
-l --loops <int>:                           The number of loops of the inference (default loops=1)
--help:                                     Show this help

It is important to mention that only the *.nb file parameter is mandatory to run the benchmark. However, you have the possibility to set the .tensor file generated from the ST Edge AI Core as an input of your benchmark.
If you know the case MAC of your model, you can set it as an argument. If it is not set as an argument, the MAC utilization computation is skipped during the benchmark and only the inference time is computed.
Finally, you can also set the loops arguments to run your model multiple times. The loops argument is set to 1 as a default value.

3.2. Testing with NBG based on YoloV8n[edit | edit source]

The model used for testing is the yolov8n_256_quant_pt_uf_pose_coco-st.nb, which is a YoloV8n that has been processed and converted to a network binary graph to run on the NPU.
The model used in this example can be installed from the following package:

 x-linux-ai -i pose-estimation-models-yolov8n

On the target, the model is located here:

/usr/local/x-linux-ai/pose-estimation/models/yolov8n_pose/

To launch the benchmark tool, use the following command:

   /usr/local/bin/nbg-benchmark-*/tools/nbg_benchmark -m  /usr/local/x-linux-ai/pose-estimation/models/yolov8n_pose/yolov8n_256_quant_pt_uf_pose_coco-st.nb

Console output:

/usr/local/bin/nbg-benchmark-5.1.0/tools/nbg_benchmark -m  /usr/local/x-linux-ai/pose-estimation/models/yolov8n_pose/yolov8n_256_quant_pt_uf_pose_coco-st.nb 
Info: Network binary file set to: /usr/local/x-linux-ai/pose-estimation/models/yolov8n_pose/yolov8n_256_quant_pt_uf_pose_coco-st.nb
Info: Verifying graph...
Info: Verifying graph took: 8ms or 8246us
Info: Copied a buffer of a size of 196608 to tensor.
Info: NPU running at frequency: 800020090Hz.
Info: Started running the graph [1] loops ...
Info: No Case MAC has been specified for this model.
Info: The MAC Utilization cannot be computed.
Info: Loop:1,Average: 17.81 ms or 17812.68 us
Info: Peak working set size: 20434944 bytes