Last edited 3 months ago

How to measure the performance of your models using ONNX Runtime

Applicable for STM32MP13x lines, STM32MP15x lines, STM32MP25x lines


This article describes how to measure the performance of an ONNX model using ONNX Runtime on STM32MPUs platform.

1. Installation[edit | edit source]

1.1. Installing from the OpenSTLinux AI package repository[edit | edit source]

Warning white.png Warning
The software package is provided AS IS, and by downloading it, you agree to be bound to the terms of the software license agreement (SLA0048). The detailed content licenses can be found here.

After having configured the AI OpenSTLinux package, install X-LINUX-AI components for this application. The minimum package required is:

 x-linux-ai -i onnxruntime-tools

The model used in this example can be installed from the following package:

 x-linux-ai -i img-models-mobilenetv2-10-224

2. How to use the benchmark application[edit | edit source]

2.1. Executing with the command line[edit | edit source]

The onnxruntime_perf_test executable is located in the userfs partition:

/usr/local/bin/onnxruntime-*/tools/onnxruntime_perf_test

It accepts the following input parameters:

usage: ./onnxruntime_perf_test [options...] model_path [result_file]
Options:
        -m [test_mode]: Specifies the test mode. Value could be 'duration' or 'times'.
                Provide 'duration' to run the test for a fix duration, and 'times' to repeated for a certain times. 
        -M: Disable memory pattern.
        -A: Disable memory arena
        -I: Generate tensor input binding (Free dimensions are treated as 1.)
        -c [parallel runs]: Specifies the (max) number of runs to invoke simultaneously. Default:1..
        -r [repeated_times]: Specifies the repeated times if running in 'times' test mode.Default:1000.
        -t [seconds_to_run]: Specifies the seconds to run for 'duration' mode. Default:600.
        -p [profile_file]: Specifies the profile name to enable profiling and dump the profile data to the file.
        -s: Show statistics result, like P75, P90. If no result_file provided this defaults to on.
        -v: Show verbose information.
        -x [intra_op_num_threads]: Sets the number of threads used to parallelize the execution within nodes, A value of 0 means ORT will pick a default. Must >=0.
        -y [inter_op_num_threads]: Sets the number of threads used to parallelize the execution of the graph (across nodes), A value of 0 means ORT will pick a default. Must >=0.
        -f [free_dimension_override]: Specifies a free dimension by name to override to a specific value for performance optimization. Syntax is [dimension_name:override_value]. override_value must > 0
        -F [free_dimension_override]: Specifies a free dimension by denotation to override to a specific value for performance optimization. Syntax is [dimension_denotation:override_value]. override_value must > 0
        -P: Use parallel executor instead of sequential executor.
        -o [optimization level]: Default is 99 (all). Valid values are 0 (disable), 1 (basic), 2 (extended), 99 (all).
                Please see onnxruntime_c_api.h (enum GraphOptimizationLevel) for the full list of all optimization levels.
        -u [optimized_model_path]: Specify the optimized model path for saving.
        -z: Set denormal as zero. When turning on this option reduces latency dramatically, a model may have denormals.

        -h: help

2.2. Testing with MobileNet[edit | edit source]

The model used for testing is mobilenet_v2_1.0_224_int8_per_tensor.onnx, installed by the img-models-mobilenetv2-10-224 package. It is a model used for image classification.
On the target, the model is located here:

/usr/local/x-linux-ai/image-classification/models/mobilenet/

To benchmark an ONNX model with onnxruntime_perf_test, use the following command:

 /usr/local/bin/onnxruntime-*/tools/onnxruntime_perf_test -I -m times -r 8 /usr/local/x-linux-ai/image-classification/models/mobilenet/mobilenet_v2_1.0_224_int8_per_tensor.onnx

Console output:

Session creation time cost: 0.632683 s
Total inference time cost: 1.41846 s
Total inference requests: 8
Average inference time cost: 177.308 ms
Total inference run time: 1.4186 s
Number of inferences per second: 5.63936 
Avg CPU usage: 98 %
Peak working set size: 52129792 bytes
Avg CPU usage:98
Peak working set size:52129792
Runs:8
Min Latency: 0.17603 s
Max Latency: 0.182825 s
P50 Latency: 0.176624 s
P90 Latency: 0.182825 s
P95 Latency: 0.182825 s
P99 Latency: 0.182825 s
P999 Latency: 0.182825 s

To obtain the best performance, it is interesting to use the additional flags -P -x 2 -y 1 to use more than one thread for the benchmark depending of the hardware used.

 /usr/local/bin/onnxruntime-*/tools/onnxruntime_perf_test -I -m times -r 8 -P -x 2 -y 1 /usr/local/x-linux-ai/image-classification/models/mobilenet/mobilenet_v2_1.0_224_int8_per_tensor.onnx

Console output:

Setting intra_op_num_threads to 2
Setting inter_op_num_threads to 1
Session creation time cost: 0.483545 s
Total inference time cost: 1.43479 s
Total inference requests: 8
Average inference time cost: 179.349 ms
Total inference run time: 1.43495 s
Number of inferences per second: 5.57509 
Avg CPU usage: 96 %
Peak working set size: 43249664 bytes
Avg CPU usage:96
Peak working set size:43249664
Runs:8
Min Latency: 0.177462 s
Max Latency: 0.181616 s
P50 Latency: 0.178953 s
P90 Latency: 0.181616 s
P95 Latency: 0.181616 s
P99 Latency: 0.181616 s
P999 Latency: 0.181616 s

To display more information, use the flag -v.

3. References[edit | edit source]