How to measure performance of your NN models using TensorFlow Lite runtime

This article describes how to measure the performance of a TensorFlow Lite neural network model on STM32MP1x plateform.

1 Installation

1.1 Installing from the OpenSTLinux AI package repository

Warning white.png Warning
The software package is provided AS IS, and by downloading it, you agree to be bound to the terms of the software license agreement (SLA). The detailed content licenses can be found here.

After having configured the AI OpenSTLinux package install X-LINUX-AI components for this application. The minimum package required is:

Board $> apt-get install tensorflow-lite-tools

The model used in this example can be installed from the following package:

Board $> apt-get install tflite-models-mobilenetv1

2 How to use the Benchmark application

2.1 Executing with the command line

The benchmark_model C/C++ application is located in the userfs partition:

/usr/local/bin/tensorflow-lite-x.x.x/tools/benchmark_model

It accepts the following input parameters:

usage: ./benchmark_model <flags>

Flags:
	--graph=                           	string	optional	graph file name
	--print_postinvoke_state=false     	bool	optional	print out the interpreter internals just before benchmark completes (i.e. after all repeated Invoke calls complete). The internals will include allocated memory size of each tensor etc.
	--print_preinvoke_state=false      	bool	optional	print out the interpreter internals just before calling Invoke. The internals will include allocated memory size of each tensor etc.
	--profiling_output_csv_file=       	string	optional	File path to export profile data as CSV, if not set prints to stdout.
	--max_profiling_buffer_entries=1024	int32	optional	max profiling buffer entries
	--enable_op_profiling=false        	bool	optional	enable op profiling
	--require_full_delegation=false    	bool	optional	require delegate to run the entire graph
	--allow_fp16=false                 	bool	optional	allow fp16
	--input_layer_value_files=         	string	optional	A map-like string representing value file. Each item is separated by ',', and the item value consists of input layer name and value file path separated by ':', e.g. input1:file_path1,input2:file_path2. If the input_name appears both in input_layer_value_range and input_layer_value_files, input_layer_value_range of the input_name will be ignored. The file format is binary and it should be array format or null separated strings format.
	--input_layer_value_range=         	string	optional	A map-like string representing value range for *integer* input layers. Each item is separated by ':', and the item value consists of input layer name and integer-only range values (both low and high are inclusive) separated by ',', e.g. input1,1,2:input2,0,254
	--input_layer_shape=               	string	optional	input layer shape
	--input_layer=                     	string	optional	input layer names
	--num_runs=50                      	int32	optional	expected number of runs, see also min_secs, max_secs
	--verbose=false                    	bool	optional	Whether to log parameters whose values are not set. By default, only log those parameters that are set by parsing their values from the commandline flags.
	--warmup_min_secs=0.5              	float	optional	minimum number of seconds to rerun for, potentially making the actual number of warm-up runs to be greater than warmup_runs
	--warmup_runs=1                    	int32	optional	minimum number of runs performed on initialization, to allow performance characteristics to settle, see also warmup_min_secs
	--output_prefix=                   	string	optional	benchmark output prefix
	--benchmark_name=                  	string	optional	benchmark name
	--use_caching=false                	bool	optional	Enable caching of prepacked weights matrices in matrix multiplication routines. Currently implies the use of the Ruy library.
	--num_threads=1                    	int32	optional	number of threads
	--run_frequency=-1                 	float	optional	Execute at a fixed frequency, instead of a fixed delay.Note if the targeted rate per second cannot be reached, the benchmark would start the next run immediately, trying its best to catch up. If set, this will override run_delay.
	--run_delay=-1                     	float	optional	delay between runs in seconds
	--max_secs=150                     	float	optional	maximum number of seconds to rerun for, potentially making the actual number of runs to be less than num_runs. Note if --max-secs is exceeded in the middle of a run, the benchmark will continue to the end of the run but will not start the next run.
	--min_secs=1                       	float	optional	minimum number of seconds to rerun for, potentially making the actual number of runs to be greater than num_runs

2.2 Testing with MobileNet V1

The model used for testing is the mobilenet_v1_0.5_128_quant.tflite downloaded from Tensorflow Lite hosted models[1]. It is a model used for image classification.
On the target, the model is located here:

/usr/local/demo-ai/computer-vision/models/mobilenet/

To launch the Benchmark application in its minimal configuration, use the following command:

Board $> /usr/local/bin/tensorflow-lite-2.5.0/tools/benchmark_model --graph=/usr/local/demo-ai/computer-vision/models/mobilenet/mobilenet_v1_0.5_128_quant.tflite

Console output:

STARTING!
Log parameter values verbosely: [0]
Graph: [/usr/local/demo-ai/computer-vision/models/mobilenet/mobilenet_v1_0.5_128_quant.tflite]
Loaded model /usr/local/demo-ai/computer-vision/models/mobilenet/mobilenet_v1_0.5_128_quant.tflite
The input model file size (MB): 1.36451
Initialized session in 7.11ms.
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
count=6 first=86950 curr=83078 min=82943 max=86950 avg=83733.5 std=1447

Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
count=50 first=83766 curr=87918 min=82910 max=87918 avg=83747.3 std=1083

Inference timings in us: Init: 7110, First inference: 86950, Warmup (avg): 83733.5, Inference (avg): 83747.3
Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at 
runtime. Take the information at your discretion.
Peak memory footprint (MB): init=2.52734 overall=3.96094

To obtain the best performances it is interesting to use the flag num_threads to use more than one thread for the benchmark depending of the hardware used.

Board $> /usr/local/bin/tensorflow-lite-2.5.0/tools/benchmark_model --graph=/usr/local/demo-ai/computer-vision/models/mobilenet/mobilenet_v1_0.5_128_quant.tflite --num_threads=2

Console output:

STARTING!
Log parameter values verbosely: [0]
Num threads: [2]
Graph: [/usr/local/demo-ai/computer-vision/models/mobilenet/mobilenet_v1_0.5_128_quant.tflite]
Loaded model /usr/local/demo-ai/computer-vision/models/mobilenet/mobilenet_v1_0.5_128_quant.tflite
The input model file size (MB): 1.36451
Initialized session in 6.484ms.
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
count=12 first=49700 curr=43522 min=43522 max=50016 avg=45037.8 std=2285

Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
count=50 first=43819 curr=44488 min=43451 max=58290 avg=45438.3 std=3255

Inference timings in us: Init: 6484, First inference: 49700, Warmup (avg): 45037.8, Inference (avg): 45438.3
Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
Peak memory footprint (MB): init=2.52734 overall=4.17969

In order to display more information, you could use the following flags verbose and enable_op_profiling.

Board $> /usr/local/bin/tensorflow-lite-2.5.0/tools/benchmark_model --graph=/usr/local/demo-ai/computer-vision/models/mobilenet/mobilenet_v1_0.5_128_quant.tflite --enable_op_profiling=true --num_threads=2 --verbose=true

Console output:

STARTING!
Log parameter values verbosely: [1]
Min num runs: [50]
Min runs duration (seconds): [1]
Max runs duration (seconds): [150]
Inter-run delay (seconds): [-1]
Number of prorated runs per second: [-1]
Num threads: [2]
Use caching: [0]
Benchmark name: []
Output prefix: []
Min warmup runs: [1]
Min warmup runs duration (seconds): [0.5]
Graph: [/usr/local/demo-ai/computer-vision/models/mobilenet/mobilenet_v1_0.5_128_quant.tflite]
Input layers: []
Input shapes: []
Input value ranges: []
Input value files: []
Allow fp16: [0]
Require full delegation: [0]
Enable op profiling: [1]
Max profiling buffer entries: [1024]
CSV File to export profiling data to: []
Print pre-invoke interpreter state: [0]
Print post-invoke interpreter state: [0]
Loaded model /usr/local/demo-ai/computer-vision/models/mobilenet/mobilenet_v1_0.5_128_quant.tflite
The input model file size (MB): 1.36451
Initialized session in 7.048ms.
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
count=12 first=47714 curr=44373 min=43478 max=47714 avg=44048.9 std=1131

Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
count=50 first=45097 curr=44107 min=43532 max=58039 avg=45254.1 std=3243

Inference timings in us: Init: 7048, First inference: 47714, Warmup (avg): 44048.9, Inference (avg): 45254.1
Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
Peak memory footprint (MB): init=2.62109 overall=4.20703
Profiling Info for Benchmark Initialization:
============================== Run Order ==============================
	             [node type]	          [start]	  [first]	 [avg ms]	     [%]	  [cdf%]	  [mem KB]	[times called]	[Name]
	         AllocateTensors	            0.000	    2.594	    2.594	100.000%	100.000%	   124.000	        1	AllocateTensors/0

============================== Top by Computation Time ==============================
	             [node type]	          [start]	  [first]	 [avg ms]	     [%]	  [cdf%]	  [mem KB]	[times called]	[Name]
	         AllocateTensors	            0.000	    2.594	    2.594	100.000%	100.000%	   124.000	        1	AllocateTensors/0

Number of nodes executed: 1
============================== Summary by node type ==============================
	             [Node type]	  [count]	  [avg ms]	    [avg %]	    [cdf %]	  [mem KB]	[times called]
	         AllocateTensors	        1	     2.594	   100.000%	   100.000%	   124.000	        1

Timings (microseconds): count=1 curr=2594
Memory (bytes): count=0
1 nodes observed



Operator-wise Profiling Info for Regular Benchmark Runs:
============================== Run Order ==============================
	             [node type]	          [start]	  [first]	 [avg ms]	     [%]	  [cdf%]	  [mem KB]	[times called]	[Name]
	                 CONV_2D	            0.034	    4.257	    3.922	  8.702%	  8.702%	     0.000	        1	[MobilenetV1/MobilenetV1/Conv2d_0/Relu6]:0
	       DEPTHWISE_CONV_2D	            3.962	    1.545	    1.802	  3.997%	 12.700%	     0.000	        1	[MobilenetV1/MobilenetV1/Conv2d_1_depthwise/Relu6]:1
	                 CONV_2D	            5.766	    4.550	    4.562	 10.122%	 22.821%	     0.000	        1	[MobilenetV1/MobilenetV1/Conv2d_1_pointwise/Relu6]:2
	       DEPTHWISE_CONV_2D	           10.334	    1.093	    1.073	  2.381%	 25.202%	     0.000	        1	[MobilenetV1/MobilenetV1/Conv2d_2_depthwise/Relu6]:3
	                 CONV_2D	           11.410	    2.508	    2.798	  6.207%	 31.409%	     0.000	        1	[MobilenetV1/MobilenetV1/Conv2d_2_pointwise/Relu6]:4
	       DEPTHWISE_CONV_2D	           14.213	    1.895	    1.827	  4.053%	 35.462%	     0.000	        1	[MobilenetV1/MobilenetV1/Conv2d_3_depthwise/Relu6]:5
	                 CONV_2D	           16.043	    3.550	    3.551	  7.878%	 43.340%	     0.000	        1	[MobilenetV1/MobilenetV1/Conv2d_3_pointwise/Relu6]:6
	       DEPTHWISE_CONV_2D	           19.599	    0.476	    0.518	  1.149%	 44.489%	     0.000	        1	[MobilenetV1/MobilenetV1/Conv2d_4_depthwise/Relu6]:7
	                 CONV_2D	           20.120	    1.623	    1.673	  3.711%	 48.201%	     0.000	        1	[MobilenetV1/MobilenetV1/Conv2d_4_pointwise/Relu6]:8
	       DEPTHWISE_CONV_2D	           21.796	    1.035	    0.841	  1.866%	 50.067%	     0.000	        1	[MobilenetV1/MobilenetV1/Conv2d_5_depthwise/Relu6]:9
	                 CONV_2D	           22.639	    2.491	    2.543	  5.642%	 55.709%	     0.000	        1	[MobilenetV1/MobilenetV1/Conv2d_5_pointwise/Relu6]:10
	       DEPTHWISE_CONV_2D	           25.186	    0.234	    0.240	  0.532%	 56.241%	     0.000	        1	[MobilenetV1/MobilenetV1/Conv2d_6_depthwise/Relu6]:11
	                 CONV_2D	           25.428	    1.242	    1.315	  2.917%	 59.158%	     0.000	        1	[MobilenetV1/MobilenetV1/Conv2d_6_pointwise/Relu6]:12
	       DEPTHWISE_CONV_2D	           26.745	    0.421	    0.434	  0.964%	 60.121%	     0.000	        1	[MobilenetV1/MobilenetV1/Conv2d_7_depthwise/Relu6]:13
	                 CONV_2D	           27.182	    2.385	    2.179	  4.834%	 64.956%	     0.000	        1	[MobilenetV1/MobilenetV1/Conv2d_7_pointwise/Relu6]:14
	       DEPTHWISE_CONV_2D	           29.364	    0.410	    0.405	  0.898%	 65.854%	     0.000	        1	[MobilenetV1/MobilenetV1/Conv2d_8_depthwise/Relu6]:15
	                 CONV_2D	           29.771	    2.133	    2.181	  4.838%	 70.692%	     0.000	        1	[MobilenetV1/MobilenetV1/Conv2d_8_pointwise/Relu6]:16
	       DEPTHWISE_CONV_2D	           31.955	    0.417	    0.422	  0.936%	 71.628%	     0.000	        1	[MobilenetV1/MobilenetV1/Conv2d_9_depthwise/Relu6]:17
	                 CONV_2D	           32.378	    2.307	    2.242	  4.974%	 76.603%	     0.000	        1	[MobilenetV1/MobilenetV1/Conv2d_9_pointwise/Relu6]:18
	       DEPTHWISE_CONV_2D	           34.623	    0.421	    0.475	  1.055%	 77.658%	     0.000	        1	[MobilenetV1/MobilenetV1/Conv2d_10_depthwise/Relu6]:19
	                 CONV_2D	           35.101	    2.128	    2.193	  4.865%	 82.523%	     0.000	        1	[MobilenetV1/MobilenetV1/Conv2d_10_pointwise/Relu6]:20
	       DEPTHWISE_CONV_2D	           37.297	    0.414	    0.407	  0.903%	 83.426%	     0.000	        1	[MobilenetV1/MobilenetV1/Conv2d_11_depthwise/Relu6]:21
	                 CONV_2D	           37.706	    2.355	    2.157	  4.786%	 88.212%	     0.000	        1	[MobilenetV1/MobilenetV1/Conv2d_11_pointwise/Relu6]:22
	       DEPTHWISE_CONV_2D	           39.866	    0.156	    0.132	  0.292%	 88.504%	     0.000	        1	[MobilenetV1/MobilenetV1/Conv2d_12_depthwise/Relu6]:23
	                 CONV_2D	           40.000	    1.263	    1.277	  2.833%	 91.337%	     0.000	        1	[MobilenetV1/MobilenetV1/Conv2d_12_pointwise/Relu6]:24
	       DEPTHWISE_CONV_2D	           41.281	    0.211	    0.195	  0.433%	 91.770%	     0.000	        1	[MobilenetV1/MobilenetV1/Conv2d_13_depthwise/Relu6]:25
	                 CONV_2D	           41.477	    2.384	    2.486	  5.516%	 97.285%	     0.000	        1	[MobilenetV1/MobilenetV1/Conv2d_13_pointwise/Relu6]:26
	         AVERAGE_POOL_2D	           43.968	    0.045	    0.051	  0.113%	 97.399%	     0.000	        1	[MobilenetV1/Logits/AvgPool_1a/AvgPool]:27
	                 CONV_2D	           44.021	    0.858	    1.112	  2.468%	 99.867%	     0.000	        1	[MobilenetV1/Logits/Conv2d_1c_1x1/BiasAdd]:28
	                 RESHAPE	           45.137	    0.008	    0.009	  0.019%	 99.886%	     0.000	        1	[MobilenetV1/Logits/SpatialSqueeze]:29
	                 SOFTMAX	           45.147	    0.050	    0.051	  0.114%	100.000%	     0.000	        1	[MobilenetV1/Predictions/Reshape_1]:30

============================== Top by Computation Time ==============================
	             [node type]	          [start]	  [first]	 [avg ms]	     [%]	  [cdf%]	  [mem KB]	[times called]	[Name]
	                 CONV_2D	            5.766	    4.550	    4.562	 10.122%	 10.122%	     0.000	        1	[MobilenetV1/MobilenetV1/Conv2d_1_pointwise/Relu6]:2
	                 CONV_2D	            0.034	    4.257	    3.922	  8.702%	 18.824%	     0.000	        1	[MobilenetV1/MobilenetV1/Conv2d_0/Relu6]:0
	                 CONV_2D	           16.043	    3.550	    3.551	  7.878%	 26.702%	     0.000	        1	[MobilenetV1/MobilenetV1/Conv2d_3_pointwise/Relu6]:6
	                 CONV_2D	           11.410	    2.508	    2.798	  6.207%	 32.909%	     0.000	        1	[MobilenetV1/MobilenetV1/Conv2d_2_pointwise/Relu6]:4
	                 CONV_2D	           22.639	    2.491	    2.543	  5.642%	 38.551%	     0.000	        1	[MobilenetV1/MobilenetV1/Conv2d_5_pointwise/Relu6]:10
	                 CONV_2D	           41.477	    2.384	    2.486	  5.516%	 44.067%	     0.000	        1	[MobilenetV1/MobilenetV1/Conv2d_13_pointwise/Relu6]:26
	                 CONV_2D	           32.378	    2.307	    2.242	  4.974%	 49.042%	     0.000	        1	[MobilenetV1/MobilenetV1/Conv2d_9_pointwise/Relu6]:18
	                 CONV_2D	           35.101	    2.128	    2.193	  4.865%	 53.907%	     0.000	        1	[MobilenetV1/MobilenetV1/Conv2d_10_pointwise/Relu6]:20
	                 CONV_2D	           29.771	    2.133	    2.181	  4.838%	 58.745%	     0.000	        1	[MobilenetV1/MobilenetV1/Conv2d_8_pointwise/Relu6]:16
	                 CONV_2D	           27.182	    2.385	    2.179	  4.834%	 63.579%	     0.000	        1	[MobilenetV1/MobilenetV1/Conv2d_7_pointwise/Relu6]:14

Number of nodes executed: 31
============================== Summary by node type ==============================
	             [Node type]	  [count]	  [avg ms]	    [avg %]	    [cdf %]	  [mem KB]	[times called]
	                 CONV_2D	       15	    36.182	    80.306%	    80.306%	     0.000	       15
	       DEPTHWISE_CONV_2D	       13	     8.763	    19.450%	    99.756%	     0.000	       13
	                 SOFTMAX	        1	     0.051	     0.113%	    99.869%	     0.000	        1
	         AVERAGE_POOL_2D	        1	     0.051	     0.113%	    99.982%	     0.000	        1
	                 RESHAPE	        1	     0.008	     0.018%	   100.000%	     0.000	        1

Timings (microseconds): count=50 first=44865 curr=43927 min=43359 max=57839 avg=45071.9 std=3241
Memory (bytes): count=0
31 nodes observed

3 References