This article describes how to measure the performance of an ONNX model using ONNX Runtime on STM32MP1x platform.
1. Installation[edit source]
1.1. Installing from the OpenSTLinux AI package repository[edit source]
After having configured the AI OpenSTLinux package, install X-LINUX-AI components for this application. The minimum package required is:
apt-get install onnxruntime-tools
2. How to use the Benchmark application[edit source]
2.1. Executing with the command line[edit source]
The onnxruntime_perf_test executable is located in the userfs partition:
/usr/local/bin/onnxruntime-x.x.x/tools/onnxruntime_perf_test
It accepts the following input parameters:
usage: ./onnxruntime_perf_test [options...] model_path [result_file] Options: -m [test_mode]: Specifies the test mode. Value could be 'duration' or 'times'. Provide 'duration' to run the test for a fix duration, and 'times' to repeated for a certain times. -M: Disable memory pattern. -A: Disable memory arena -I: Generate tensor input binding (Free dimensions are treated as 1.) -c [parallel runs]: Specifies the (max) number of runs to invoke simultaneously. Default:1.. -r [repeated_times]: Specifies the repeated times if running in 'times' test mode.Default:1000. -t [seconds_to_run]: Specifies the seconds to run for 'duration' mode. Default:600. -p [profile_file]: Specifies the profile name to enable profiling and dump the profile data to the file. -s: Show statistics result, like P75, P90. If no result_file provided this defaults to on. -v: Show verbose information. -x [intra_op_num_threads]: Sets the number of threads used to parallelize the execution within nodes, A value of 0 means ORT will pick a default. Must >=0. -y [inter_op_num_threads]: Sets the number of threads used to parallelize the execution of the graph (across nodes), A value of 0 means ORT will pick a default. Must >=0. -f [free_dimension_override]: Specifies a free dimension by name to override to a specific value for performance optimization. Syntax is [dimension_name:override_value]. override_value must > 0 -F [free_dimension_override]: Specifies a free dimension by denotation to override to a specific value for performance optimization. Syntax is [dimension_denotation:override_value]. override_value must > 0 -P: Use parallel executor instead of sequential executor. -o [optimization level]: Default is 99 (all). Valid values are 0 (disable), 1 (basic), 2 (extended), 99 (all). Please see onnxruntime_c_api.h (enum GraphOptimizationLevel) for the full list of all optimization levels. -u [optimized_model_path]: Specify the optimized model path for saving. -z: Set denormal as zero. When turning on this option reduces latency dramatically, a model may have denormals. -h: help
2.2. Testing with MobileNet V2[edit source]
The model used for testing is MobileNet v2-1.0-int8 downloaded from the onnx/models[1] repository.
It is a model used for image classification.
On the target, the model is located here:
/usr/local/demo-ai/computer-vision/models/onnx/mobilenet/
To benchmark an ONNX model with onnxruntime_perf_test, use the following command:
/usr/local/bin/onnxruntime-1.11.0/tools/onnxruntime_perf_test -I -m times -r 8 /usr/local/demo-ai/computer-vision/models/onnx/mobilenet/mobilenetv2-12-int8.onnx
Console output:
Session creation time cost: 0.321219 s Total inference time cost: 6.65975 s Total inference requests: 8 Average inference time cost: 832.469 ms Total inference run time: 6.65992 s Avg CPU usage: 49 % Peak working set size: 31289344 bytes Avg CPU usage:49 Peak working set size:31289344 Runs:8 Min Latency: 0.81186 s Max Latency: 0.843032 s P50 Latency: 0.837183 s P90 Latency: 0.843032 s P95 Latency: 0.843032 s P99 Latency: 0.843032 s P999 Latency: 0.843032 s
To obtain the best performances it is interesting to use the additional flags -P -x 2 -y 2 to use more than one thread for the benchmark depending of the hardware used.
/usr/local/bin/onnxruntime-1.11.0/tools/onnxruntime_perf_test -I -m times -r 8 -P -x 2 -y 2 /usr/local/demo-ai/computer-vision/models/onnx/mobilenet/mobilenetv2-12-int8.onnx
Console output:
Setting intra_op_num_threads to 2 Setting inter_op_num_threads to 2 Session creation time cost: 0.380196 s Total inference time cost: 3.64146 s Total inference requests: 8 Average inference time cost: 455.182 ms Total inference run time: 3.64163 s Avg CPU usage: 96 % Peak working set size: 33357824 bytes Avg CPU usage:96 Peak working set size:33357824 Runs:8 Min Latency: 0.434487 s Max Latency: 0.479169 s P50 Latency: 0.455549 s P90 Latency: 0.479169 s P95 Latency: 0.479169 s P99 Latency: 0.479169 s P999 Latency: 0.479169 s
In order to display more information, you could use the flag -v.
/usr/local/bin/onnxruntime-1.11.0/tools/onnxruntime_perf_test -I -m times -r 8 -P -x 2 -y 2 -v /usr/local/demo-ai/computer-vision/models/onnx/mobilenet/mobilenetv2-12-int8.onnx
Console output (excerpt):
Setting intra_op_num_threads to 2 Setting inter_op_num_threads to 2 2022-08-08 12:38:04.156272731 [I:onnxruntime:, inference_session.cc:324 operator()] Flush-to-zero and denormal-as-zero are off 2022-08-08 12:38:04.156620149 [I:onnxruntime:, inference_session.cc:331 ConstructorCommon] Creating and using per session threadpools since use_per_session_threads_ is true 2022-08-08 12:38:04.156790358 [I:onnxruntime:, inference_session.cc:351 ConstructorCommon] Dynamic block base set to 0 2022-08-08 12:38:04.245297680 [I:onnxruntime:, inference_session.cc:1327 Initialize] Initializing session. 2022-08-08 12:38:04.255709583 [I:onnxruntime:, inference_session.cc:1364 Initialize] Adding default CPU execution provider. 2022-08-08 12:38:04.328995239 [I:onnxruntime:, reshape_fusion.cc:42 ApplyImpl] Total fused reshape node count: 0 2022-08-08 12:38:04.339136808 [I:onnxruntime:, reshape_fusion.cc:42 ApplyImpl] Total fused reshape node count: 0 ... 2022-08-08 12:38:04.461602805 [V:onnxruntime:, inference_session.cc:150 VerifyEachNodeIsAssignedToAnEp] Node placements 2022-08-08 12:38:04.461805847 [V:onnxruntime:, inference_session.cc:152 VerifyEachNodeIsAssignedToAnEp] All nodes have been placed on [CPUExecutionProvider]. 2022-08-08 12:38:04.465085231 [V:onnxruntime:, session_state.cc:68 CreateGraphInfo] SaveMLValueNameIndexMapping 2022-08-08 12:38:04.467233028 [V:onnxruntime:, session_state.cc:114 CreateGraphInfo] Done saving OrtValue mappings. 2022-08-08 12:38:04.472646876 [I:onnxruntime:, session_state_utils.cc:140 SaveInitializedTensors] Saving initialized tensors. 2022-08-08 12:38:04.500036908 [I:onnxruntime:, session_state_utils.cc:266 SaveInitializedTensors] Done saving initialized tensors 2022-08-08 12:38:04.567663966 [I:onnxruntime:, inference_session.cc:1576 Initialize] Session successfully initialized. 2022-08-08 12:38:04.569381262 [I:onnxruntime:, parallel_executor.cc:110 RunNodeAsync] Begin execution 2022-08-08 12:38:05.044396709 [I:onnxruntime:, parallel_executor.cc:110 RunNodeAsync] Begin execution 2022-08-08 12:38:05.049980599 [I:onnxruntime:, parallel_executor.cc:110 RunNodeAsync] Begin execution 2022-08-08 12:38:05.481864556 [I:onnxruntime:, parallel_executor.cc:110 RunNodeAsync] Begin execution iteration:1,time_cost:0.437829 2022-08-08 12:38:05.488209698 [I:onnxruntime:, parallel_executor.cc:110 RunNodeAsync] Begin execution 2022-08-08 12:38:05.918691567 [I:onnxruntime:, parallel_executor.cc:110 RunNodeAsync] Begin execution iteration:2,time_cost:0.436364 2022-08-08 12:38:05.924688166 [I:onnxruntime:, parallel_executor.cc:110 RunNodeAsync] Begin execution 2022-08-08 12:38:06.395984895 [I:onnxruntime:, parallel_executor.cc:110 RunNodeAsync] Begin execution iteration:3,time_cost:0.476823 2022-08-08 12:38:06.401708703 [I:onnxruntime:, parallel_executor.cc:110 RunNodeAsync] Begin execution 2022-08-08 12:38:06.841845140 [I:onnxruntime:, parallel_executor.cc:110 RunNodeAsync] Begin execution iteration:4,time_cost:0.44602 2022-08-08 12:38:06.847975989 [I:onnxruntime:, parallel_executor.cc:110 RunNodeAsync] Begin execution 2022-08-08 12:38:07.278073899 [I:onnxruntime:, parallel_executor.cc:110 RunNodeAsync] Begin execution iteration:5,time_cost:0.435996 2022-08-08 12:38:07.284063790 [I:onnxruntime:, parallel_executor.cc:110 RunNodeAsync] Begin execution 2022-08-08 12:38:07.723888810 [I:onnxruntime:, parallel_executor.cc:110 RunNodeAsync] Begin execution iteration:6,time_cost:0.445308 2022-08-08 12:38:07.729596784 [I:onnxruntime:, parallel_executor.cc:110 RunNodeAsync] Begin execution 2022-08-08 12:38:08.183826009 [I:onnxruntime:, parallel_executor.cc:110 RunNodeAsync] Begin execution iteration:7,time_cost:0.45994 2022-08-08 12:38:08.190072401 [I:onnxruntime:, parallel_executor.cc:110 RunNodeAsync] Begin execution 2022-08-08 12:38:08.621355814 [I:onnxruntime:, parallel_executor.cc:110 RunNodeAsync] Begin execution iteration:8,time_cost:0.438163 Session creation time cost: 0.413285 s Total inference time cost: 3.57644 s Total inference requests: 8 Average inference time cost: 447.055 ms Total inference run time: 3.57796 s Avg CPU usage: 97 % Peak working set size: 33345536 bytes Avg CPU usage:97 Peak working set size:33345536 Runs:8 Min Latency: 0.435996 s Max Latency: 0.476823 s P50 Latency: 0.445308 s P90 Latency: 0.476823 s P95 Latency: 0.476823 s P99 Latency: 0.476823 s P999 Latency: 0.476823 s
3. References[edit source]