This article describes how to measure the performance of a TensorFlow Lite neural network model on STM32MPUs platforms.
1. Installation[edit | edit source]
1.1. Installing from the OpenSTLinux AI package repository[edit | edit source]
After having configured the AI OpenSTLinux package install X-LINUX-AI components for this application. The minimum package required is:
x-linux-ai -i tensorflow-lite-tools
The model used in this example can be installed from the following package:
x-linux-ai -i img-models-mobilenetv2-10-224
2. How to use the Benchmark application[edit | edit source]
2.1. Executing with the command line[edit | edit source]
The benchmark_model C/C++ application is located in the userfs partition:
/usr/local/bin/tensorflow-lite-*/tools/benchmark_model
It accepts the following input parameters:
usage: ./benchmark_model <flags> Flags: --num_runs=50 int32 optional expected number of runs, see also min_secs, max_secs --min_secs=1 float optional minimum number of seconds to rerun for, potentially s --max_secs=150 float optional maximum number of seconds to rerun for, potentially . --run_delay=-1 float optional delay between runs in seconds --run_frequency=-1 float optional Execute at a fixed frequency, instead of a fixed del. --num_threads=-1 int32 optional number of threads --use_caching=false bool optional Enable caching of prepacked weights matrices in matr. --benchmark_name= string optional benchmark name --output_prefix= string optional benchmark output prefix --warmup_runs=1 int32 optional minimum number of runs performed on initialization, s --warmup_min_secs=0.5 float optional minimum number of seconds to rerun for, potentially s --verbose=false bool optional Whether to log parameters whose values are not set. . --dry_run=false bool optional Whether to run the tool just with simply loading the. --report_peak_memory_footprint=false bool optional Report the peak memory footprint by periodically che. --memory_footprint_check_interval_ms=50 int32 optional The interval in millisecond between two consecutive . --graph= string optional graph file name --input_layer= string optional input layer names --input_layer_shape= string optional input layer shape --input_layer_value_range= string optional A map-like string representing value range for *inte4 --input_layer_value_files= string optional A map-like string representing value file. Each item. --allow_fp16=false bool optional allow fp16 --require_full_delegation=false bool optional require delegate to run the entire graph --enable_op_profiling=false bool optional enable op profiling --max_profiling_buffer_entries=1024 int32 optional max profiling buffer entries --profiling_output_csv_file= string optional File path to export profile data as CSV, if not set . --print_preinvoke_state=false bool optional print out the interpreter internals just before call. --print_postinvoke_state=false bool optional print out the interpreter internals just before benc. --release_dynamic_tensors=false bool optional Ensure dynamic tensor's memory is released when they. --help=false bool optional Print out all supported flags if true. --num_threads=-1 int32 optional number of threads used for inference on CPU. --max_delegated_partitions=0 int32 optional Max number of partitions to be delegated. --min_nodes_per_partition=0 int32 optional The minimal number of TFLite graph nodes of a partit. --delegate_serialize_dir= string optional Directory to be used by delegates for serializing an. --delegate_serialize_token= string optional Model-specific token acting as a namespace for deleg. --external_delegate_path= string optional The library path for the underlying external. --external_delegate_options= string optional A list of comma-separated options to be passed to th.
2.2. Testing with MobileNet[edit | edit source]
The model used for testing is the mobilenet_v2_1.0_224_int8_per_tensor.tflite downloaded from STM32 AI model zoo[1].
It is a model used for image classification.
On the target, the model is located here:
/usr/local/x-linux-ai/image-classification/models/mobilenet/
There are several types of delegation possible with this benchmark to improve the performances.
2.2.1. Benchmark on NPU[edit | edit source]
In this part we will show how to use the benchmark with NPU acceleration. Please, expand the following section.
To use the acceleration of the NPU we have to add an option to the benchmark to allow it to delegate the execution of the neural network, in our case we will delegate the operations to the VX delegate. The option to use is --external_delegate_path=/usr/lib/libvx_delegate.so.2, which will give the following command:
/usr/local/bin/tensorflow-lite-*/tools/benchmark_model --graph=/usr/local/x-linux-ai/image-classification/models/mobilenet/mobilenet_v2_1.0_224_int8_per_tensor.tflite --num_threads=2 --external_delegate_path=/usr/lib/libvx_delegate.so.2
![]() |
When using the NPU, there is a warm-up time that can sometimes be quite long depending on the model used. |
Console output:
INFO: STARTING! INFO: Log parameter values verbosely: [0] INFO: Num threads: [2] INFO: Graph: [/usr/local/x-linux-ai/image-classification/models/mobilenet/mobilenet_v2_1.0_224_int8_per_tensor.tflite] INFO: #threads used for CPU inference: [2] INFO: External delegate path: [/usr/lib/libvx_delegate.so.2] INFO: Loaded model /usr/local/x-linux-ai/image-classification/models/mobilenet/mobilenet_v2_1.0_224_int8_per_tensor.tflite INFO: Vx delegate: allowed_cache_mode set to 0. INFO: Vx delegate: device num set to 0. INFO: Vx delegate: allowed_builtin_code set to 0. INFO: Vx delegate: error_during_init set to 0. INFO: Vx delegate: error_during_prepare set to 0. INFO: Vx delegate: error_during_invoke set to 0. INFO: EXTERNAL delegate created. INFO: Explicitly applied EXTERNAL delegate, and the model graph will be completely executed by the delegate. INFO: The input model file size (MB): 3.59541 INFO: Initialized session in 348.762ms. INFO: Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds. W [HandleLayoutInfer:332]Op 162: default layout inference pass. W [HandleLayoutInfer:332]Op 162: default layout inference pass. W [HandleLayoutInfer:332]Op 162: default layout inference pass. W [HandleLayoutInfer:332]Op 162: default layout inference pass. W [HandleLayoutInfer:332]Op 162: default layout inference pass. W [HandleLayoutInfer:332]Op 162: default layout inference pass. W [HandleLayoutInfer:332]Op 162: default layout inference pass. W [HandleLayoutInfer:332]Op 162: default layout inference pass. W [HandleLayoutInfer:332]Op 162: default layout inference pass. W [HandleLayoutInfer:332]Op 162: default layout inference pass. W [HandleLayoutInfer:332]Op 162: default layout inference pass. W [HandleLayoutInfer:332]Op 162: default layout inference pass. W [HandleLayoutInfer:332]Op 162: default layout inference pass. W [HandleLayoutInfer:332]Op 162: default layout inference pass. W [HandleLayoutInfer:332]Op 162: default layout inference pass. W [HandleLayoutInfer:332]Op 162: default layout inference pass. W [HandleLayoutInfer:332]Op 162: default layout inference pass. INFO: count=1 curr=21792293 INFO: Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds. INFO: count=78 first=12990 curr=12820 min=12632 max=13711 avg=12758.7 std=133 INFO: Inference timings in us: Init: 348762, First inference: 21792293, Warmup (avg): 2.17923e+07, Inference (avg): 12758.7 INFO: Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion. INFO: Memory footprint delta from the start of the tool (MB): init=10.625 overall=115.172
2.2.2. Benchmark on CPU[edit | edit source]
The easiest way to use the benchmark is to run it on the CPU. Please, expand the following section to learn how to use it.
To do this you need to run at least the benchmark with the --graph option. But to go a little further, it can be interesting to add the number of CPU cores as an option to the benchmark to improve the performances. Here is the command to execute:
/usr/local/bin/tensorflow-lite-*/tools/benchmark_model --graph=/usr/local/x-linux-ai/image-classification/models/mobilenet/mobilenet_v2_1.0_224_int8_per_tensor.tflite --num_threads=2 external_delegate_path=/usr/lib/libvx_delegate.so.2
Console output:
STARTING! Log parameter values verbosely: [0] Num threads: [2] Graph: [/usr/local/x-linux-ai/image-classification/models/mobilenet/mobilenet_v2_1.0_224_int8_per_tensor.tflite] #threads used for CPU inference: [2] Loaded model /usr/local/x-linux-ai/image-classification/models/mobilenet/mobilenet_v2_1.0_224_int8_per_tensor.tflite INFO: Created TensorFlow Lite XNNPACK delegate for CPU. The input model file size (MB): 3.59541 Initialized session in 273.952ms. Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds. count=5 first=133187 curr=119112 min=119112 max=133187 avg=122056 std=5566 Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds. count=50 first=120156 curr=119232 min=119081 max=128264 avg=119760 std=1422 Inference timings in us: Init: 273952, First inference: 133187, Warmup (avg): 122056, Inference (avg): 119760 Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion. Memory footprint delta from the start of the tool (MB): init=13.4102 overall=19.6641
2.2.3. Benchmark on GPU[edit | edit source]
In this part we will show how to use the benchmark with GPU acceleration. Please, expand the following section.
The way to do it is similar to the NPU one, however it will be necessary to export an environment variable to force the use of the GPU only. First, export the following environment variable:
export VIV_VX_DISABLE_TP_NN=1
Then, run the command:
/usr/local/bin/tensorflow-lite-*/tools/benchmark_model --graph=/usr/local/x-linux-ai/image-
classification/models/mobilenet/mobilenet_v2_1.0_224_int8_per_tensor.tflite --num_threads=2 --external_delegate_path=/usr/lib/libvx_delegate.so.2
Console output:
INFO: STARTING! INFO: Log parameter values verbosely: [0] INFO: Num threads: [2] INFO: Graph: [/usr/local/x-linux-ai/image-classification/models/mobilenet/mobilenet_v2_1.0_224_int8_per_tensor.tflite] INFO: #threads used for CPU inference: [2] INFO: External delegate path: [/usr/lib/libvx_delegate.so.2] INFO: Loaded model /usr/local/x-linux-ai/image-classification/models/mobilenet/mobilenet_v2_1.0_224_int8_per_tensor.tflite INFO: Vx delegate: allowed_cache_mode set to 0. INFO: Vx delegate: device num set to 0. INFO: Vx delegate: allowed_builtin_code set to 0. INFO: Vx delegate: error_during_init set to 0. INFO: Vx delegate: error_during_prepare set to 0. INFO: Vx delegate: error_during_invoke set to 0. INFO: EXTERNAL delegate created. INFO: Explicitly applied EXTERNAL delegate, and the model graph will be completely executed by the delegate. INFO: The input model file size (MB): 3.59541 INFO: Initialized session in 31.554ms. INFO: Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds. W [HandleLayoutInfer:332]Op 162: default layout inference pass. W [HandleLayoutInfer:332]Op 162: default layout inference pass. W [HandleLayoutInfer:332]Op 162: default layout inference pass. W [HandleLayoutInfer:332]Op 162: default layout inference pass. W [HandleLayoutInfer:332]Op 162: default layout inference pass. W [HandleLayoutInfer:332]Op 162: default layout inference pass. W [HandleLayoutInfer:332]Op 162: default layout inference pass. W [HandleLayoutInfer:332]Op 162: default layout inference pass. W [HandleLayoutInfer:332]Op 162: default layout inference pass. W [HandleLayoutInfer:332]Op 162: default layout inference pass. W [HandleLayoutInfer:332]Op 162: default layout inference pass. W [HandleLayoutInfer:332]Op 162: default layout inference pass. W [HandleLayoutInfer:332]Op 162: default layout inference pass. W [HandleLayoutInfer:332]Op 162: default layout inference pass. W [HandleLayoutInfer:332]Op 162: default layout inference pass. W [HandleLayoutInfer:332]Op 162: default layout inference pass. W [HandleLayoutInfer:332]Op 162: default layout inference pass. INFO: count=1 curr=2296912 INFO: Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds. INFO: count=50 first=72759 curr=72012 min=71791 max=72759 avg=72064.3 std=152 INFO: Inference timings in us: Init: 31554, First inference: 2296912, Warmup (avg): 2.29691e+06, Inference (avg): 72064.3 INFO: Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion. INFO: Memory footprint delta from the start of the tool (MB): init=10.75 overall=143.371
3. References[edit | edit source]