How to measure the performance of your models using ONNX Runtime

This message will disappear after all relevant tasks have been resolved.

Semantic MediaWiki

There are 1 incomplete or pending task to finish installation of Semantic MediaWiki. An administrator or user with sufficient rights can complete it. This should be done before adding new data to avoid inconsistencies.

Applicable for

STM32MP13x lines, STM32MP15x lines

This article describes how to measure the performance of an ONNX model using ONNX Runtime on STM32MP1x platform.

1. Installation[edit source]

1.1. Installing from the OpenSTLinux AI package repository[edit source]

Warning

The software package is provided AS IS, and by downloading it, you agree to be bound to the terms of the software license agreement (SLA). The detailed content licenses can be found here.

After having configured the AI OpenSTLinux package, install X-LINUX-AI components for this application. The minimum package required is:

 apt-get install onnxruntime-tools

2. How to use the Benchmark application[edit source]

2.1. Executing with the command line[edit source]

The onnxruntime_perf_test executable is located in the userfs partition:

/usr/local/bin/onnxruntime-x.x.x/tools/onnxruntime_perf_test

It accepts the following input parameters:

usage: ./onnxruntime_perf_test [options...] model_path [result_file]
Options:
        -m [test_mode]: Specifies the test mode. Value could be 'duration' or 'times'.
                Provide 'duration' to run the test for a fix duration, and 'times' to repeated for a certain times. 
        -M: Disable memory pattern.
        -A: Disable memory arena
        -I: Generate tensor input binding (Free dimensions are treated as 1.)
        -c [parallel runs]: Specifies the (max) number of runs to invoke simultaneously. Default:1..
        -r [repeated_times]: Specifies the repeated times if running in 'times' test mode.Default:1000.
        -t [seconds_to_run]: Specifies the seconds to run for 'duration' mode. Default:600.
        -p [profile_file]: Specifies the profile name to enable profiling and dump the profile data to the file.
        -s: Show statistics result, like P75, P90. If no result_file provided this defaults to on.
        -v: Show verbose information.
        -x [intra_op_num_threads]: Sets the number of threads used to parallelize the execution within nodes, A value of 0 means ORT will pick a default. Must >=0.
        -y [inter_op_num_threads]: Sets the number of threads used to parallelize the execution of the graph (across nodes), A value of 0 means ORT will pick a default. Must >=0.
        -f [free_dimension_override]: Specifies a free dimension by name to override to a specific value for performance optimization. Syntax is [dimension_name:override_value]. override_value must > 0
        -F [free_dimension_override]: Specifies a free dimension by denotation to override to a specific value for performance optimization. Syntax is [dimension_denotation:override_value]. override_value must > 0
        -P: Use parallel executor instead of sequential executor.
        -o [optimization level]: Default is 99 (all). Valid values are 0 (disable), 1 (basic), 2 (extended), 99 (all).
                Please see onnxruntime_c_api.h (enum GraphOptimizationLevel) for the full list of all optimization levels.
        -u [optimized_model_path]: Specify the optimized model path for saving.
        -z: Set denormal as zero. When turning on this option reduces latency dramatically, a model may have denormals.

        -h: help

2.2. Testing with MobileNet V2[edit source]

The model used for testing is MobileNet v2-1.0-int8 downloaded from the onnx/models^[1] repository. It is a model used for image classification.
On the target, the model is located here:

/usr/local/demo-ai/computer-vision/models/onnx/mobilenet/

To benchmark an ONNX model with onnxruntime_perf_test, use the following command:

 /usr/local/bin/onnxruntime-1.11.0/tools/onnxruntime_perf_test -I -m times -r 8 /usr/local/demo-ai/computer-vision/models/onnx/mobilenet/mobilenetv2-12-int8.onnx

Console output:

Session creation time cost: 0.321219 s
Total inference time cost: 6.65975 s
Total inference requests: 8
Average inference time cost: 832.469 ms
Total inference run time: 6.65992 s
Avg CPU usage: 49 %
Peak working set size: 31289344 bytes
Avg CPU usage:49
Peak working set size:31289344
Runs:8
Min Latency: 0.81186 s
Max Latency: 0.843032 s
P50 Latency: 0.837183 s
P90 Latency: 0.843032 s
P95 Latency: 0.843032 s
P99 Latency: 0.843032 s
P999 Latency: 0.843032 s

To obtain the best performances it is interesting to use the additional flags -P -x 2 -y 2 to use more than one thread for the benchmark depending of the hardware used.

 /usr/local/bin/onnxruntime-1.11.0/tools/onnxruntime_perf_test -I -m times -r 8 -P -x 2 -y 2 /usr/local/demo-ai/computer-vision/models/onnx/mobilenet/mobilenetv2-12-int8.onnx

Console output:

Setting intra_op_num_threads to 2
Setting inter_op_num_threads to 2
Session creation time cost: 0.380196 s
Total inference time cost: 3.64146 s
Total inference requests: 8
Average inference time cost: 455.182 ms
Total inference run time: 3.64163 s
Avg CPU usage: 96 %
Peak working set size: 33357824 bytes
Avg CPU usage:96
Peak working set size:33357824
Runs:8
Min Latency: 0.434487 s
Max Latency: 0.479169 s
P50 Latency: 0.455549 s
P90 Latency: 0.479169 s
P95 Latency: 0.479169 s
P99 Latency: 0.479169 s
P999 Latency: 0.479169 s

In order to display more information, you could use the flag -v.

 /usr/local/bin/onnxruntime-1.11.0/tools/onnxruntime_perf_test -I -m times -r 8 -P -x 2 -y 2 -v /usr/local/demo-ai/computer-vision/models/onnx/mobilenet/mobilenetv2-12-int8.onnx

Console output (excerpt):

Setting intra_op_num_threads to 2
Setting inter_op_num_threads to 2
2022-08-08 12:38:04.156272731 [I:onnxruntime:, inference_session.cc:324 operator()] Flush-to-zero and denormal-as-zero are off
2022-08-08 12:38:04.156620149 [I:onnxruntime:, inference_session.cc:331 ConstructorCommon] Creating and using per session threadpools since use_per_session_threads_ is true
2022-08-08 12:38:04.156790358 [I:onnxruntime:, inference_session.cc:351 ConstructorCommon] Dynamic block base set to 0
2022-08-08 12:38:04.245297680 [I:onnxruntime:, inference_session.cc:1327 Initialize] Initializing session.
2022-08-08 12:38:04.255709583 [I:onnxruntime:, inference_session.cc:1364 Initialize] Adding default CPU execution provider.
2022-08-08 12:38:04.328995239 [I:onnxruntime:, reshape_fusion.cc:42 ApplyImpl] Total fused reshape node count: 0
2022-08-08 12:38:04.339136808 [I:onnxruntime:, reshape_fusion.cc:42 ApplyImpl] Total fused reshape node count: 0
...
2022-08-08 12:38:04.461602805 [V:onnxruntime:, inference_session.cc:150 VerifyEachNodeIsAssignedToAnEp] Node placements
2022-08-08 12:38:04.461805847 [V:onnxruntime:, inference_session.cc:152 VerifyEachNodeIsAssignedToAnEp] All nodes have been placed on [CPUExecutionProvider].
2022-08-08 12:38:04.465085231 [V:onnxruntime:, session_state.cc:68 CreateGraphInfo] SaveMLValueNameIndexMapping
2022-08-08 12:38:04.467233028 [V:onnxruntime:, session_state.cc:114 CreateGraphInfo] Done saving OrtValue mappings.
2022-08-08 12:38:04.472646876 [I:onnxruntime:, session_state_utils.cc:140 SaveInitializedTensors] Saving initialized tensors.
2022-08-08 12:38:04.500036908 [I:onnxruntime:, session_state_utils.cc:266 SaveInitializedTensors] Done saving initialized tensors
2022-08-08 12:38:04.567663966 [I:onnxruntime:, inference_session.cc:1576 Initialize] Session successfully initialized.
2022-08-08 12:38:04.569381262 [I:onnxruntime:, parallel_executor.cc:110 RunNodeAsync] Begin execution
2022-08-08 12:38:05.044396709 [I:onnxruntime:, parallel_executor.cc:110 RunNodeAsync] Begin execution
2022-08-08 12:38:05.049980599 [I:onnxruntime:, parallel_executor.cc:110 RunNodeAsync] Begin execution
2022-08-08 12:38:05.481864556 [I:onnxruntime:, parallel_executor.cc:110 RunNodeAsync] Begin execution
iteration:1,time_cost:0.437829
2022-08-08 12:38:05.488209698 [I:onnxruntime:, parallel_executor.cc:110 RunNodeAsync] Begin execution
2022-08-08 12:38:05.918691567 [I:onnxruntime:, parallel_executor.cc:110 RunNodeAsync] Begin execution
iteration:2,time_cost:0.436364
2022-08-08 12:38:05.924688166 [I:onnxruntime:, parallel_executor.cc:110 RunNodeAsync] Begin execution
2022-08-08 12:38:06.395984895 [I:onnxruntime:, parallel_executor.cc:110 RunNodeAsync] Begin execution
iteration:3,time_cost:0.476823
2022-08-08 12:38:06.401708703 [I:onnxruntime:, parallel_executor.cc:110 RunNodeAsync] Begin execution
2022-08-08 12:38:06.841845140 [I:onnxruntime:, parallel_executor.cc:110 RunNodeAsync] Begin execution
iteration:4,time_cost:0.44602
2022-08-08 12:38:06.847975989 [I:onnxruntime:, parallel_executor.cc:110 RunNodeAsync] Begin execution
2022-08-08 12:38:07.278073899 [I:onnxruntime:, parallel_executor.cc:110 RunNodeAsync] Begin execution
iteration:5,time_cost:0.435996
2022-08-08 12:38:07.284063790 [I:onnxruntime:, parallel_executor.cc:110 RunNodeAsync] Begin execution
2022-08-08 12:38:07.723888810 [I:onnxruntime:, parallel_executor.cc:110 RunNodeAsync] Begin execution
iteration:6,time_cost:0.445308
2022-08-08 12:38:07.729596784 [I:onnxruntime:, parallel_executor.cc:110 RunNodeAsync] Begin execution
2022-08-08 12:38:08.183826009 [I:onnxruntime:, parallel_executor.cc:110 RunNodeAsync] Begin execution
iteration:7,time_cost:0.45994
2022-08-08 12:38:08.190072401 [I:onnxruntime:, parallel_executor.cc:110 RunNodeAsync] Begin execution
2022-08-08 12:38:08.621355814 [I:onnxruntime:, parallel_executor.cc:110 RunNodeAsync] Begin execution
iteration:8,time_cost:0.438163
Session creation time cost: 0.413285 s
Total inference time cost: 3.57644 s
Total inference requests: 8
Average inference time cost: 447.055 ms
Total inference run time: 3.57796 s
Avg CPU usage: 97 %
Peak working set size: 33345536 bytes
Avg CPU usage:97
Peak working set size:33345536
Runs:8
Min Latency: 0.435996 s
Max Latency: 0.476823 s
P50 Latency: 0.445308 s
P90 Latency: 0.476823 s
P95 Latency: 0.476823 s
P99 Latency: 0.476823 s
P999 Latency: 0.476823 s

3. References[edit source]

↑ MobileNet in onnx/models

[tflite_hub_url-1] MobileNet in onnx/models

[1]