How to measure performance of your NN models using TensorFlow Lite runtime

Applicable for STM32MP13x lines, STM32MP15x lines

This article describes how to measure the performance of a TensorFlow Lite neural network model on STM32MP1x plateform.

1 Installation[edit]

1.1 Installing from the OpenSTLinux AI package repository[edit]

Warning white.png Warning
The software package is provided AS IS, and by downloading it, you agree to be bound to the terms of the software license agreement (SLA0048). The detailed content licenses can be found here.

After having configured the AI OpenSTLinux package install X-LINUX-AI components for this application. The minimum package required is:

 apt-get install tensorflow-lite-tools

The model used in this example can be installed from the following package:

 apt-get install tflite-models-mobilenetv1

2 How to use the Benchmark application[edit]

2.1 Executing with the command line[edit]

The benchmark_model C/C++ application is located in the userfs partition:

/usr/local/bin/tensorflow-lite-2.11.0/tools/benchmark_model

It accepts the following input parameters:

usage: ./benchmark_model <flags>

Flags:
	--num_runs=50                                  	int32	optional	expected number of runs, see also min_secs, max_secs
	--min_secs=1                                   	float	optional	minimum number of seconds to rerun for, potentially making the actual number of runs to be greater than num_runs
	--max_secs=150                                 	float	optional	maximum number of seconds to rerun for, potentially making the actual number of runs to be less than num_runs. Note if --max-secs is exceeded in the middle of a run, the benchmark will continue to the end of the run but will not start the next run.
	--run_delay=-1                                 	float	optional	delay between runs in seconds
	--run_frequency=-1                             	float	optional	Execute at a fixed frequency, instead of a fixed delay.Note if the targeted rate per second cannot be reached, the benchmark would start the next run immediately, trying its best to catch up. If set, this will override run_delay.
	--num_threads=-1                               	int32	optional	number of threads
	--use_caching=false                            	bool	optional	Enable caching of prepacked weights matrices in matrix multiplication routines. Currently implies the use of the Ruy library.
	--benchmark_name=                              	string	optional	benchmark name
	--output_prefix=                               	string	optional	benchmark output prefix
	--warmup_runs=1                                	int32	optional	minimum number of runs performed on initialization, to allow performance characteristics to settle, see also warmup_min_secs
	--warmup_min_secs=0.5                          	float	optional	minimum number of seconds to rerun for, potentially making the actual number of warm-up runs to be greater than warmup_runs
	--verbose=false                                	bool	optional	Whether to log parameters whose values are not set. By default, only log those parameters that are set by parsing their values from the commandline flags.
	--dry_run=false                                	bool	optional	Whether to run the tool just with simply loading the model, allocating tensors etc. but without actually invoking any op kernels.
	--report_peak_memory_footprint=false           	bool	optional	Report the peak memory footprint by periodically checking the memory footprint. Internally, a separate thread will be spawned for this periodic check. Therefore, the performance benchmark result could be affected.
	--memory_footprint_check_interval_ms=50        	int32	optional	The interval in millisecond between two consecutive memory footprint checks. This is only used when --report_peak_memory_footprint is set to true.
	--graph=                                       	string	optional	graph file name
	--input_layer=                                 	string	optional	input layer names
	--input_layer_shape=                           	string	optional	input layer shape
	--input_layer_value_range=                     	string	optional	A map-like string representing value range for *integer* input layers. Each item is separated by ':', and the item value consists of input layer name and integer-only range values (both low and high are inclusive) separated by ',', e.g. input1,1,2:input2,0,254
	--input_layer_value_files=                     	string	optional	A map-like string representing value file. Each item is separated by ',', and the item value consists of input layer name and value file path separated by ':', e.g. input1:file_path1,input2:file_path2. In case the input layer name contains ':' e.g. "input:0", escape it with "\:". If the input_name appears both in input_layer_value_range and input_layer_value_files, input_layer_value_range of the input_name will be ignored. The file format is binary and it should be array format or null separated strings format.
	--allow_fp16=false                             	bool	optional	allow fp16
	--require_full_delegation=false                	bool	optional	require delegate to run the entire graph
	--enable_op_profiling=false                    	bool	optional	enable op profiling
	--max_profiling_buffer_entries=1024            	int32	optional	max initial profiling buffer entries
	--allow_dynamic_profiling_buffer_increase=false	bool	optional	allow dynamic increase on profiling buffer entries
	--profiling_output_csv_file=                   	string	optional	File path to export profile data as CSV, if not set prints to stdout.
	--print_preinvoke_state=false                  	bool	optional	print out the interpreter internals just before calling Invoke. The internals will include allocated memory size of each tensor etc.
	--print_postinvoke_state=false                 	bool	optional	print out the interpreter internals just before benchmark completes (i.e. after all repeated Invoke calls complete). The internals will include allocated memory size of each tensor etc.
	--release_dynamic_tensors=false                	bool	optional	Ensure dynamic tensor's memory is released when they are not used.
	--optimize_memory_for_large_tensors=0          	int32	optional	Optimize memory usage for large tensors with sacrificing latency.
	--disable_delegate_clustering=false            	bool	optional	Disable delegate clustering.
	--output_filepath=                             	string	optional	File path to export outputs layer as binary data.
	--help=false                                   	bool	optional	Print out all supported flags if true.
	--num_threads=-1                               	int32	optional	number of threads used for inference on CPU.
	--max_delegated_partitions=0                   	int32	optional	Max number of partitions to be delegated.
	--min_nodes_per_partition=0                    	int32	optional	The minimal number of TFLite graph nodes of a partition that has to be reached for it to be delegated.A negative value or 0 means to use the default choice of each delegate.
	--delegate_serialize_dir=                      	string	optional	Directory to be used by delegates for serializing any model data. This allows the delegate to save data into this directory to reduce init time after the first run. Currently supported by NNAPI delegate with specific backends on Android. Note that delegate_serialize_token is also required to enable this feature.
	--delegate_serialize_token=                    	string	optional	Model-specific token acting as a namespace for delegate serialization. Unique tokens ensure that the delegate doesn't read inapplicable/invalid data. Note that delegate_serialize_dir is also required to enable this feature.
	--external_delegate_path=                      	string	optional	The library path for the underlying external.
	--external_delegate_options=                   	string	optional	A list of comma-separated options to be passed to the external delegate. Each option is a colon-separated key-value pair, e.g. option_name:option_value.

2.2 Testing with MobileNet V1[edit]

The model used for testing is the mobilenet_v1_0.5_128_quant.tflite downloaded from Tensorflow Hub[1]. It is a model used for image classification.
On the target, the model is located here:

/usr/local/demo-ai/computer-vision/models/mobilenet/

To launch the Benchmark application in its minimal configuration, use the following command:

 /usr/local/bin/tensorflow-lite-2.11.0/tools/benchmark_model --graph=/usr/local/demo-ai/computer-vision/models/mobilenet/mobilenet_v1_0.5_128_quant.tflite

Console output:

STARTING!
Log parameter values verbosely: [0]
Graph: [/usr/local/demo-ai/computer-vision/models/mobilenet/mobilenet_v1_0.5_128_quant.tflite]
Loaded model /usr/local/demo-ai/computer-vision/models/mobilenet/mobilenet_v1_0.5_128_quant.tflite
The input model file size (MB): 1.36451
Initialized session in 7.11ms.
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
count=6 first=86950 curr=83078 min=82943 max=86950 avg=83733.5 std=1447

Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
count=50 first=83766 curr=87918 min=82910 max=87918 avg=83747.3 std=1083

Inference timings in us: Init: 7110, First inference: 86950, Warmup (avg): 83733.5, Inference (avg): 83747.3
Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at 
runtime. Take the information at your discretion.
Peak memory footprint (MB): init=2.52734 overall=3.96094

To obtain the best performances it is interesting to use the flag num_threads to use more than one thread for the benchmark depending of the hardware used.

 /usr/local/bin/tensorflow-lite-2.11.0/tools/benchmark_model --graph=/usr/local/demo-ai/computer-vision/models/mobilenet/mobilenet_v1_0.5_128_quant.tflite --num_threads=2

Console output:

STARTING!
Log parameter values verbosely: [0]
Num threads: [2]
Graph: [/usr/local/demo-ai/computer-vision/models/mobilenet/mobilenet_v1_0.5_128_quant.tflite]
Loaded model /usr/local/demo-ai/computer-vision/models/mobilenet/mobilenet_v1_0.5_128_quant.tflite
The input model file size (MB): 1.36451
Initialized session in 6.484ms.
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
count=12 first=49700 curr=43522 min=43522 max=50016 avg=45037.8 std=2285

Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
count=50 first=43819 curr=44488 min=43451 max=58290 avg=45438.3 std=3255

Inference timings in us: Init: 6484, First inference: 49700, Warmup (avg): 45037.8, Inference (avg): 45438.3
Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
Peak memory footprint (MB): init=2.52734 overall=4.17969

In order to display more information, you could use the following flags verbose and enable_op_profiling.

 /usr/local/bin/tensorflow-lite-2.11.0/tools/benchmark_model --graph=/usr/local/demo-ai/computer-vision/models/mobilenet/mobilenet_v1_0.5_128_quant.tflite --enable_op_profiling=true --num_threads=2 --verbose=true

Console output:

STARTING!
Log parameter values verbosely: [1]
Min num runs: [50]
Min runs duration (seconds): [1]
Max runs duration (seconds): [150]
Inter-run delay (seconds): [-1]
Number of prorated runs per second: [-1]
Num threads: [2]
Use caching: [0]
Benchmark name: []
Output prefix: []
Min warmup runs: [1]
Min warmup runs duration (seconds): [0.5]
Graph: [/usr/local/demo-ai/computer-vision/models/mobilenet/mobilenet_v1_0.5_128_quant.tflite]
Input layers: []
Input shapes: []
Input value ranges: []
Input value files: []
Allow fp16: [0]
Require full delegation: [0]
Enable op profiling: [1]
Max profiling buffer entries: [1024]
CSV File to export profiling data to: []
Print pre-invoke interpreter state: [0]
Print post-invoke interpreter state: [0]
Loaded model /usr/local/demo-ai/computer-vision/models/mobilenet/mobilenet_v1_0.5_128_quant.tflite
The input model file size (MB): 1.36451
Initialized session in 7.048ms.
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
count=12 first=47714 curr=44373 min=43478 max=47714 avg=44048.9 std=1131

Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
count=50 first=45097 curr=44107 min=43532 max=58039 avg=45254.1 std=3243

Inference timings in us: Init: 7048, First inference: 47714, Warmup (avg): 44048.9, Inference (avg): 45254.1
Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
Peak memory footprint (MB): init=2.62109 overall=4.20703
Profiling Info for Benchmark Initialization:
============================== Run Order ==============================
	             [node type]	          [start]	  [first]	 [avg ms]	     [%]	  [cdf%]	  [mem KB]	[times called]	[Name]
	         AllocateTensors	            0.000	    2.594	    2.594	100.000%	100.000%	   124.000	        1	AllocateTensors/0

============================== Top by Computation Time ==============================
	             [node type]	          [start]	  [first]	 [avg ms]	     [%]	  [cdf%]	  [mem KB]	[times called]	[Name]
	         AllocateTensors	            0.000	    2.594	    2.594	100.000%	100.000%	   124.000	        1	AllocateTensors/0

Number of nodes executed: 1
============================== Summary by node type ==============================
	             [Node type]	  [count]	  [avg ms]	    [avg %]	    [cdf %]	  [mem KB]	[times called]
	         AllocateTensors	        1	     2.594	   100.000%	   100.000%	   124.000	        1

Timings (microseconds): count=1 curr=2594
Memory (bytes): count=0
1 nodes observed



Operator-wise Profiling Info for Regular Benchmark Runs:
============================== Run Order ==============================
	             [node type]	          [start]	  [first]	 [avg ms]	     [%]	  [cdf%]	  [mem KB]	[times called]	[Name]
	                 CONV_2D	            0.034	    4.257	    3.922	  8.702%	  8.702%	     0.000	        1	[MobilenetV1/MobilenetV1/Conv2d_0/Relu6]:0
	       DEPTHWISE_CONV_2D	            3.962	    1.545	    1.802	  3.997%	 12.700%	     0.000	        1	[MobilenetV1/MobilenetV1/Conv2d_1_depthwise/Relu6]:1
	                 CONV_2D	            5.766	    4.550	    4.562	 10.122%	 22.821%	     0.000	        1	[MobilenetV1/MobilenetV1/Conv2d_1_pointwise/Relu6]:2
	       DEPTHWISE_CONV_2D	           10.334	    1.093	    1.073	  2.381%	 25.202%	     0.000	        1	[MobilenetV1/MobilenetV1/Conv2d_2_depthwise/Relu6]:3
	                 CONV_2D	           11.410	    2.508	    2.798	  6.207%	 31.409%	     0.000	        1	[MobilenetV1/MobilenetV1/Conv2d_2_pointwise/Relu6]:4
	       DEPTHWISE_CONV_2D	           14.213	    1.895	    1.827	  4.053%	 35.462%	     0.000	        1	[MobilenetV1/MobilenetV1/Conv2d_3_depthwise/Relu6]:5
	                 CONV_2D	           16.043	    3.550	    3.551	  7.878%	 43.340%	     0.000	        1	[MobilenetV1/MobilenetV1/Conv2d_3_pointwise/Relu6]:6
	       DEPTHWISE_CONV_2D	           19.599	    0.476	    0.518	  1.149%	 44.489%	     0.000	        1	[MobilenetV1/MobilenetV1/Conv2d_4_depthwise/Relu6]:7
	                 CONV_2D	           20.120	    1.623	    1.673	  3.711%	 48.201%	     0.000	        1	[MobilenetV1/MobilenetV1/Conv2d_4_pointwise/Relu6]:8
	       DEPTHWISE_CONV_2D	           21.796	    1.035	    0.841	  1.866%	 50.067%	     0.000	        1	[MobilenetV1/MobilenetV1/Conv2d_5_depthwise/Relu6]:9
	                 CONV_2D	           22.639	    2.491	    2.543	  5.642%	 55.709%	     0.000	        1	[MobilenetV1/MobilenetV1/Conv2d_5_pointwise/Relu6]:10
	       DEPTHWISE_CONV_2D	           25.186	    0.234	    0.240	  0.532%	 56.241%	     0.000	        1	[MobilenetV1/MobilenetV1/Conv2d_6_depthwise/Relu6]:11
	                 CONV_2D	           25.428	    1.242	    1.315	  2.917%	 59.158%	     0.000	        1	[MobilenetV1/MobilenetV1/Conv2d_6_pointwise/Relu6]:12
	       DEPTHWISE_CONV_2D	           26.745	    0.421	    0.434	  0.964%	 60.121%	     0.000	        1	[MobilenetV1/MobilenetV1/Conv2d_7_depthwise/Relu6]:13
	                 CONV_2D	           27.182	    2.385	    2.179	  4.834%	 64.956%	     0.000	        1	[MobilenetV1/MobilenetV1/Conv2d_7_pointwise/Relu6]:14
	       DEPTHWISE_CONV_2D	           29.364	    0.410	    0.405	  0.898%	 65.854%	     0.000	        1	[MobilenetV1/MobilenetV1/Conv2d_8_depthwise/Relu6]:15
	                 CONV_2D	           29.771	    2.133	    2.181	  4.838%	 70.692%	     0.000	        1	[MobilenetV1/MobilenetV1/Conv2d_8_pointwise/Relu6]:16
	       DEPTHWISE_CONV_2D	           31.955	    0.417	    0.422	  0.936%	 71.628%	     0.000	        1	[MobilenetV1/MobilenetV1/Conv2d_9_depthwise/Relu6]:17
	                 CONV_2D	           32.378	    2.307	    2.242	  4.974%	 76.603%	     0.000	        1	[MobilenetV1/MobilenetV1/Conv2d_9_pointwise/Relu6]:18
	       DEPTHWISE_CONV_2D	           34.623	    0.421	    0.475	  1.055%	 77.658%	     0.000	        1	[MobilenetV1/MobilenetV1/Conv2d_10_depthwise/Relu6]:19
	                 CONV_2D	           35.101	    2.128	    2.193	  4.865%	 82.523%	     0.000	        1	[MobilenetV1/MobilenetV1/Conv2d_10_pointwise/Relu6]:20
	       DEPTHWISE_CONV_2D	           37.297	    0.414	    0.407	  0.903%	 83.426%	     0.000	        1	[MobilenetV1/MobilenetV1/Conv2d_11_depthwise/Relu6]:21
	                 CONV_2D	           37.706	    2.355	    2.157	  4.786%	 88.212%	     0.000	        1	[MobilenetV1/MobilenetV1/Conv2d_11_pointwise/Relu6]:22
	       DEPTHWISE_CONV_2D	           39.866	    0.156	    0.132	  0.292%	 88.504%	     0.000	        1	[MobilenetV1/MobilenetV1/Conv2d_12_depthwise/Relu6]:23
	                 CONV_2D	           40.000	    1.263	    1.277	  2.833%	 91.337%	     0.000	        1	[MobilenetV1/MobilenetV1/Conv2d_12_pointwise/Relu6]:24
	       DEPTHWISE_CONV_2D	           41.281	    0.211	    0.195	  0.433%	 91.770%	     0.000	        1	[MobilenetV1/MobilenetV1/Conv2d_13_depthwise/Relu6]:25
	                 CONV_2D	           41.477	    2.384	    2.486	  5.516%	 97.285%	     0.000	        1	[MobilenetV1/MobilenetV1/Conv2d_13_pointwise/Relu6]:26
	         AVERAGE_POOL_2D	           43.968	    0.045	    0.051	  0.113%	 97.399%	     0.000	        1	[MobilenetV1/Logits/AvgPool_1a/AvgPool]:27
	                 CONV_2D	           44.021	    0.858	    1.112	  2.468%	 99.867%	     0.000	        1	[MobilenetV1/Logits/Conv2d_1c_1x1/BiasAdd]:28
	                 RESHAPE	           45.137	    0.008	    0.009	  0.019%	 99.886%	     0.000	        1	[MobilenetV1/Logits/SpatialSqueeze]:29
	                 SOFTMAX	           45.147	    0.050	    0.051	  0.114%	100.000%	     0.000	        1	[MobilenetV1/Predictions/Reshape_1]:30

============================== Top by Computation Time ==============================
	             [node type]	          [start]	  [first]	 [avg ms]	     [%]	  [cdf%]	  [mem KB]	[times called]	[Name]
	                 CONV_2D	            5.766	    4.550	    4.562	 10.122%	 10.122%	     0.000	        1	[MobilenetV1/MobilenetV1/Conv2d_1_pointwise/Relu6]:2
	                 CONV_2D	            0.034	    4.257	    3.922	  8.702%	 18.824%	     0.000	        1	[MobilenetV1/MobilenetV1/Conv2d_0/Relu6]:0
	                 CONV_2D	           16.043	    3.550	    3.551	  7.878%	 26.702%	     0.000	        1	[MobilenetV1/MobilenetV1/Conv2d_3_pointwise/Relu6]:6
	                 CONV_2D	           11.410	    2.508	    2.798	  6.207%	 32.909%	     0.000	        1	[MobilenetV1/MobilenetV1/Conv2d_2_pointwise/Relu6]:4
	                 CONV_2D	           22.639	    2.491	    2.543	  5.642%	 38.551%	     0.000	        1	[MobilenetV1/MobilenetV1/Conv2d_5_pointwise/Relu6]:10
	                 CONV_2D	           41.477	    2.384	    2.486	  5.516%	 44.067%	     0.000	        1	[MobilenetV1/MobilenetV1/Conv2d_13_pointwise/Relu6]:26
	                 CONV_2D	           32.378	    2.307	    2.242	  4.974%	 49.042%	     0.000	        1	[MobilenetV1/MobilenetV1/Conv2d_9_pointwise/Relu6]:18
	                 CONV_2D	           35.101	    2.128	    2.193	  4.865%	 53.907%	     0.000	        1	[MobilenetV1/MobilenetV1/Conv2d_10_pointwise/Relu6]:20
	                 CONV_2D	           29.771	    2.133	    2.181	  4.838%	 58.745%	     0.000	        1	[MobilenetV1/MobilenetV1/Conv2d_8_pointwise/Relu6]:16
	                 CONV_2D	           27.182	    2.385	    2.179	  4.834%	 63.579%	     0.000	        1	[MobilenetV1/MobilenetV1/Conv2d_7_pointwise/Relu6]:14

Number of nodes executed: 31
============================== Summary by node type ==============================
	             [Node type]	  [count]	  [avg ms]	    [avg %]	    [cdf %]	  [mem KB]	[times called]
	                 CONV_2D	       15	    36.182	    80.306%	    80.306%	     0.000	       15
	       DEPTHWISE_CONV_2D	       13	     8.763	    19.450%	    99.756%	     0.000	       13
	                 SOFTMAX	        1	     0.051	     0.113%	    99.869%	     0.000	        1
	         AVERAGE_POOL_2D	        1	     0.051	     0.113%	    99.982%	     0.000	        1
	                 RESHAPE	        1	     0.008	     0.018%	   100.000%	     0.000	        1

Timings (microseconds): count=50 first=44865 curr=43927 min=43359 max=57839 avg=45071.9 std=3241
Memory (bytes): count=0
31 nodes observed

3 References[edit]