How to measure performance of your NN models using TensorFlow Lite runtime

Applicable for

STM32MP13x lines, STM32MP15x lines, STM32MP25x lines

This article describes how to measure the performance of a TensorFlow Lite neural network model on STM32MPUs platforms.

1. Installation[edit source]

1.1. Installing from the OpenSTLinux AI package repository[edit source]

Warning

The software package is provided AS IS, and by downloading it, you agree to be bound to the terms of the software license agreement (SLA0048). The detailed content licenses can be found here.

After having configured the AI OpenSTLinux package install X-LINUX-AI components for this application. The minimum package required is:

 x-linux-ai -i tensorflow-lite-tools

The model used in this example can be installed from the following package:

 x-linux-ai -i img-models-mobilenetv2-10-224

2. How to use the Benchmark application[edit source]

2.1. Executing with the command line[edit source]

The benchmark_model C/C++ application is located in the userfs partition:

/usr/local/bin/tensorflow-lite-*/tools/benchmark_model

It accepts the following input parameters:

usage: ./benchmark_model <flags>
Flags:
        --num_runs=50                           int32   optional        expected number of runs, see also min_secs, max_secs
        --min_secs=1                            float   optional        minimum number of seconds to rerun for, potentially s
        --max_secs=150                          float   optional        maximum number of seconds to rerun for, potentially .
        --run_delay=-1                          float   optional        delay between runs in seconds
        --run_frequency=-1                      float   optional        Execute at a fixed frequency, instead of a fixed del.
        --num_threads=-1                        int32   optional        number of threads
        --use_caching=false                     bool    optional        Enable caching of prepacked weights matrices in matr.
        --benchmark_name=                       string  optional        benchmark name
        --output_prefix=                        string  optional        benchmark output prefix
        --warmup_runs=1                         int32   optional        minimum number of runs performed on initialization, s
        --warmup_min_secs=0.5                   float   optional        minimum number of seconds to rerun for, potentially s
        --verbose=false                         bool    optional        Whether to log parameters whose values are not set. .
        --dry_run=false                         bool    optional        Whether to run the tool just with simply loading the.
        --report_peak_memory_footprint=false    bool    optional        Report the peak memory footprint by periodically che.
        --memory_footprint_check_interval_ms=50 int32   optional        The interval in millisecond between two consecutive .
        --graph=                                string  optional        graph file name
        --input_layer=                          string  optional        input layer names
        --input_layer_shape=                    string  optional        input layer shape
        --input_layer_value_range=              string  optional        A map-like string representing value range for *inte4
        --input_layer_value_files=              string  optional        A map-like string representing value file. Each item.
        --allow_fp16=false                      bool    optional        allow fp16
        --require_full_delegation=false         bool    optional        require delegate to run the entire graph
        --enable_op_profiling=false             bool    optional        enable op profiling
        --max_profiling_buffer_entries=1024     int32   optional        max profiling buffer entries
        --profiling_output_csv_file=            string  optional        File path to export profile data as CSV, if not set .
        --print_preinvoke_state=false           bool    optional        print out the interpreter internals just before call.
        --print_postinvoke_state=false          bool    optional        print out the interpreter internals just before benc.
        --release_dynamic_tensors=false         bool    optional        Ensure dynamic tensor's memory is released when they.
        --help=false                            bool    optional        Print out all supported flags if true.
        --num_threads=-1                        int32   optional        number of threads used for inference on CPU.
        --max_delegated_partitions=0            int32   optional        Max number of partitions to be delegated.
        --min_nodes_per_partition=0             int32   optional        The minimal number of TFLite graph nodes of a partit.
        --delegate_serialize_dir=               string  optional        Directory to be used by delegates for serializing an.
        --delegate_serialize_token=             string  optional        Model-specific token acting as a namespace for deleg.
        --external_delegate_path=               string  optional        The library path for the underlying external.
        --external_delegate_options=            string  optional        A list of comma-separated options to be passed to th.

2.2. Testing with MobileNet[edit source]

The model used for testing is the mobilenet_v2_1.0_224_int8_per_tensor.tflite downloaded from STM32 AI model zoo^[1]. It is a model used for image classification.
On the target, the model is located here:

/usr/local/x-linux-ai/image-classification/models/mobilenet/

2.2.1. Benchmark on CPU[edit source]

The easiest way to use the benchmark is to run it on the CPU. Please, expand the following section to learn how to use it.

To do this you need to run at least the benchmark with the --graph option. But to go a little further, it can be interesting to add the number of CPU cores as an option to the benchmark to improve the performances. Here is the command to execute:

 /usr/local/bin/tensorflow-lite-*/tools/benchmark_model --graph=/usr/local/x-linux-ai/image-classification/models/mobilenet/mobilenet_v2_1.0_224_int8_per_tensor.tflite  --num_threads=2

Console output:

STARTING!
Log parameter values verbosely: [0]
Num threads: [2]
Graph: [/usr/local/x-linux-ai/image-classification/models/mobilenet/mobilenet_v2_1.0_224_int8_per_tensor.tflite]
#threads used for CPU inference: [2]
Loaded model /usr/local/x-linux-ai/image-classification/models/mobilenet/mobilenet_v2_1.0_224_int8_per_tensor.tflite
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
The input model file size (MB): 3.59541
Initialized session in 273.952ms.
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
count=5 first=133187 curr=119112 min=119112 max=133187 avg=122056 std=5566

Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
count=50 first=120156 curr=119232 min=119081 max=128264 avg=119760 std=1422

Inference timings in us: Init: 273952, First inference: 133187, Warmup (avg): 122056, Inference (avg): 119760
Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
Memory footprint delta from the start of the tool (MB): init=13.4102 overall=19.6641

3. References[edit source]

↑ STM32 AI model zoo

[model_zoo_url-1] STM32 AI model zoo

[1]