STM32Cube.AI model performances

Revision as of 11:43, 8 September 2021 by Registered User

This article is providing benchmark of a set of well-known or reference pre-trained neural network models. Some STM32 results will be officially submitted to the MLPerf™ Tiny benchmark from MLCommons™.

Info white.png Information
  • STM32Cube.AI is a software aiming at the generation of optimized C code for STM32 and neural network inference. It is delivered under the Mix Ultimate Liberty+OSS+3rd-party V1 software license agreement (SLA0048).
  • Inference time, current and energy measures process is described, not done in a certified laboratory but can be reproduce by any user. The results are average values and will vary depending on the input data (random data are currently used), temperature and the STM32 device itself.
  • Published data on this article are not contractual.

1. Benchmark Results

1.1. STM32 High Performance

STM32 Board STM32
characteristics
Model
Source/Link
Flash
(KiB)
RAM
(KiB)
Proc
Time
(ms)
Cur.
(mA)
Energy
(mJ)
3.3V
Version
STM32H723
NUCLEO-H723ZG
Flash 1MB
RAM 564KB (432)
Freq 550MHz
MobileNet v1 0.25
128 quant tfl
source
468 66 49 203 33 Cube AI 7.0.0
Cube IDE 1.7.0
STM32H723
NUCLEO-H723ZG
Flash 1MB
RAM 564KB (432)
Freq 550MHz
FoodReco quant h5
deriv MobileNet
FP-AI-VISION1
132 148 51 197 33 Cube AI 7.0.0
Cube IDE 1.7.0
STM32H723
NUCLEO-H723ZG
Flash 1MB
RAM 564KB (432)
Freq 550MHz
Person Presence
MobileNet v2 128
FP-AI-VISION1
403 197 93 200 62 Cube AI 7.0.0
Cube IDE 1.7.0
STM32H723
NUCLEO-H723ZG
Flash 1MB
RAM 564KB (432)
Freq 550MHz
Anomaly Detection
v0.5 tfl
MLPerf™Tiny
265 0.75 1.2 176 0.7 Cube AI 7.0.0
Cube IDE 1.7.0
STM32H723
NUCLEO-H723ZG
Flash 1MB
RAM 564KB (432)
Freq 550MHz
Key Word Spotting
v0.5 tfl
MLPerf™Tiny
24 18 11.5 190 7 Cube AI 7.0.0
Cube IDE 1.7.0
STM32H723
NUCLEO-H723ZG
Flash 1MB
RAM 564KB (432)
Freq 550MHz
Image Classif.
v0.5 tfl
MLPerf™Tiny
77 49 37 190 23 Cube AI 7.0.0
Cube IDE 1.7.0
STM32H723
NUCLEO-H723ZG
Flash 1MB
RAM 564KB (432)
Freq 550MHz
Visual Wake Word
v0.5 tfl
MLPerf™Tiny
214 37 31 198 20 Cube AI 7.0.0
Cube IDE 1.7.0
STM32H743
NUCLEO-H743ZI
Flash 2MB
RAM 1MB (512)
Freq 480MHz
MobileNet v1 0.25
128 quant tfl
source
468 66 49 203 33 Cube AI 7.0.0
Cube IDE 1.7.0
STM32H743
NUCLEO-H743ZI
Flash 2MB
RAM 1MB (512)
Freq 480MHz
FoodReco quant h5
deriv MobileNet
FP-AI-VISION1
132 148 59 192 37 Cube AI 7.0.0
Cube IDE 1.7.0
STM32H743
NUCLEO-H743ZI
Flash 2MB
RAM 1MB (512)
Freq 480MHz
Person Presence
MobileNet v2 128
FP-AI-VISION1
403 197 108 194 69 Cube AI 7.0.0
Cube IDE 1.7.0
STM32H743
NUCLEO-H743ZI
Flash 2MB
RAM 1MB (512)
Freq 480MHz
Anomaly Detection
v0.5 tfl
MLPerf™Tiny
265 0.75 1.4 168 0.78 Cube AI 7.0.0
Cube IDE 1.7.0
STM32H743
NUCLEO-H743ZI
Flash 2MB
RAM 1MB (512)
Freq 480MHz
Key Word Spotting
v0.5 tfl
MLPerf™Tiny
24 18 13 196 8.7 Cube AI 7.0.0
Cube IDE 1.7.0
STM32H743
NUCLEO-H743ZI
Flash 2MB
RAM 1MB (512)
Freq 480MHz
Image Classif.
v0.5 tfl
MLPerf™Tiny
77 49 42 183 25.6 Cube AI 7.0.0
Cube IDE 1.7.0
STM32H743
NUCLEO-H743ZI
Flash 2MB
RAM 1MB (512)
Freq 480MHz
Visual Wake Word
v0.5 tfl
MLPerf™Tiny
214 37 36 189 22 Cube AI 7.0.0
Cube IDE 1.7.0
STM32H747 SMPS
STM32H747I-DISCO
Cortex® M7
Flash 2MB
RAM 1MB (0.5)
Freq 400MHz(1)
MobileNet v1 0.25
128 quant tfl
source
468 66 68 68 15 Cube AI 7.0.0
Cube IDE 1.7.0
STM32H747 SMPS
STM32H747I-DISCO
Cortex® M7
Flash 2MB
RAM 1MB (0.5)
Freq 400MHz(1)
FoodReco quant h5
deriv MobileNet
FP-AI-VISION1
132 148 70.5 69.5 16 Cube AI 7.0.0
Cube IDE 1.7.0
STM32H747 SMPS
STM32H747I-DISCO
Cortex® M7
Flash 2MB
RAM 1MB (0.5)
Freq 400MHz(1)
Person Presence
MobileNet v2 128
FP-AI-VISION1
403 197 130 69.5 30 Cube AI 7.0.0
Cube IDE 1.7.0
STM32H747 SMPS
STM32H747I-DISCO
Cortex® M7
Flash 2MB
RAM 1MB (0.5)
Freq 400MHz(1)
Anomaly Detection
v0.5 tfl
MLPerf™Tiny
265 0.75 1.6 64 0.34 Cube AI 7.0.0
Cube IDE 1.7.0
STM32H747 SMPS
STM32H747I-DISCO
Cortex® M7
Flash 2MB
RAM 1MB (0.5)
Freq 400MHz(1)
Key Word Spotting
v0.5 tfl
MLPerf™Tiny
24 18 16 70 3.7 Cube AI 7.0.0
Cube IDE 1.7.0
STM32H747 SMPS
STM32H747I-DISCO
Cortex® M7
Flash 2MB
RAM 1MB (0.5)
Freq 400MHz(1)
Image Classif.
v0.5 tfl
MLPerf™Tiny
77 49 51 66 11 Cube AI 7.0.0
Cube IDE 1.7.0
STM32H747 SMPS
STM32H747I-DISCO
Cortex® M7
Flash 2MB
RAM 1MB (0.5)
Freq 400MHz(1)
Visual Wake Word
v0.5 tfl
MLPerf™Tiny
214 37 43 68.5 9.7 Cube AI 7.0.0
Cube IDE 1.7.0
STM32H7A3
NUCLEO-H7A3ZI-Q
Flash 2MB
RAM 1.4MB (1.18)
Freq 280MHz
MobileNet v1 0.25
128 quant tfl
source
468 66 96 44 14 Cube AI 7.0.0
Cube IDE 1.7.0
STM32H7A3
NUCLEO-H7A3ZI-Q
Flash 2MB
RAM 1.4MB (1.18)
Freq 280MHz
FoodReco quant h5
deriv MobileNet
FP-AI-VISION1
132 148 100 43.5 14 Cube AI 7.0.0
Cube IDE 1.7.0
STM32H7A3
NUCLEO-H7A3ZI-Q
Flash 2MB
RAM 1.4MB (1.18)
Freq 280MHz
Person Presence
MobileNet v2 128
FP-AI-VISION1
403 197 184 44 26 Cube AI 7.0.0
Cube IDE 1.7.0
STM32H7A3
NUCLEO-H7A3ZI-Q
Flash 2MB
RAM 1.4MB (1.18)
Freq 280MHz
Anomaly Detection
v0.5 tfl
MLPerf™Tiny
265 0.75 2.3 40 0.3 Cube AI 7.0.0
Cube IDE 1.7.0
STM32H7A3
NUCLEO-H7A3ZI-Q
Flash 2MB
RAM 1.4MB (1.18)
Freq 280MHz
Key Word Spotting
v0.5 tfl
MLPerf™Tiny
24 18 23 44.5 3.3 Cube AI 7.0.0
Cube IDE 1.7.0
STM32H7A3
NUCLEO-H7A3ZI-Q
Flash 2MB
RAM 1.4MB (1.18)
Freq 280MHz
Image Classif.
v0.5 tfl
MLPerf™Tiny
77 49 72 42 10 Cube AI 7.0.0
Cube IDE 1.7.0
STM32H7A3
NUCLEO-H7A3ZI-Q
Flash 2MB
RAM 1.4MB (1.18)
Freq 280MHz
Visual Wake Word
v0.5 tfl
MLPerf™Tiny
214 37 61 43 9 Cube AI 7.0.0
Cube IDE 1.7.0

(1) On Cortex® M7 core in SMPS mode 400MHz instead of 480 max in LDO.

For a given STM32 in a fixed configuration, the current consumption is in the same range regardless of the model. it might however vary depending on the complexity and topology of the model. The following table is providing the average current consumption of the model listed in the table above table (excluding the Anomaly Detection model which has a specific topology). These data can be used as a first estimation of the current consumption and the energy consumption of a new model from just the measurement of its inference time.

STM32 Board STM32H723
550 MHz
STM32H743
480 MHz
STM32H747
400 MHz SMPS
STM32H7A3
280 MHz
Average
current (mA)
200 180 100 50

1.2. STM32 Ultra Low Power

STM32 Board STM32
characteristics
Model
Source/Link
Flash
Wgt.
RAM
Buf.
Proc
Time
Cur.
(mA)
Energy
(mJ)
3.3V
Version
STM32U585
B-U585I-IOT02A
Flash 2MB
RAM 786KB
Freq 160MHz
MobileNet v1 0.25
128 quant tfl
source
KB KB ms NA NA Cube AI 7.0.0
Cube IDE 1.7.0
STM32L4R5
NUCLEO-L4R5ZI
Flash 2MB
RAM 640KB
Freq 120MHz
MobileNet v1 0.25
128 quant tfl
source
KB KB ms NA NA Cube AI v7.0.0
Cube IDE v1.7.0

The following table is providing the average current consumption of the model listed in the table above table (excluding the Anomaly Detection model which has a specific topology). These data can be used as a first estimation of the current consumption and the energy consumption of a new model from just the measurement of its inference time.

STM32 Board STM32U585
160 MHz
STM32L4R5
120 MHz
Average
current (mA)
20 30

2. Measure process

On this benchmark only the machine learning model inference processing is reported. In a complete application, the sensor acquisition, the data conditioning and pre-processing shall also be considered.

The STM32 characteristics column provides the available internal Flash size, the full internal RAM size and the frequency. The RAM size includes the different kind of memories and banks, TCM, SRAM etc. For the time being, the buffers used by X-CUBE-AI shall be placed in a continuous memory area, the maximal RAM size available in continuous area is provided between "()" if not equal to the full size. The frequency indicated is the operating frequency used for the benchmark, so generally the maximal frequency. The only different case is with the STM32H747 Discovery Kit which is operating by default in SMPS power mode and therefore is limited to 400 MHz instead of 480 MHz. Data are rounded to 3 decimals.

The memory footprints are the one reported by X-CUBE-AI using the "Analyze" function (the version of X-CUBE-AI used is mentioned in the table).

The column Model Source/Links indicates the pre-trained ML model and the source, either how it was built / trained or where it can be downloaded. tfl stands for TensorFlow™ Lite .tflite model , h5 stands for Keras .h5 model, quant for quantized models on 8 bits. For FP-AI-VISION1 models, they are located in the package directory: FP-AI-VISION1_V3.0.0\Utilities\AI_resources.

The column Flash Wgt. reports the model weights occupancy in Flash.

The column RAM Buf. reports the RAM buffers occupancy used to store the model activations as well as input and output buffers. Note that to gain RAM space the "Use activation buffer for input buffer" and "Use activation buffer for the output buffer" options are selected (through X-CUBE-AI Advanced Settings panel).

The column Proc Time reports the model inference processing time. When the current / energy is indicated the measure is done thanks to X-CUBE-AI "System Performance" application following the process described on this WiKi article on power measurement. Otherwise the "Validation on target" application is used. In all case, when generating the application, the selected clock source is always the HSI, X-CUBE-AI is generating first the optimal clock settings and eventually afterwards the clock is set to HSI. STM32CubeMX then autonomously reconfigures the clock settings.

Cur. and Energy is the current and energy computed following the process describe in the WiKi article on power measurement. For STM32 Ultra Low Power microcontrollers, measures are done with the X-NUCLEO-LPM01A power shield as described in the section 4.3.1 "Measure process when current is below 50 mA". For STM32 High Performance microcontrollers measures are done with the Qoitec Otii Arc power analyzer as described in the section 4.3.2 Measure process when current is above 50 mA. In both cases, a 10 s windows is used for averaging) and HSI is selected as clock source.

Accuracy is not reported. X-CUBE-AI is not modifying the DL/ML model topology. The impact on accuracy should be limited. X-CUBE-AI is providing through the "Validation" application a way to measure the accuracy either on x86 or on the target. It can be used to check the eventual impact on accuracy. When running the "Validation on target" application several metrics are computed, one of them is the X-Cross providing error metrics between the original model executed in Python and the C model executed on the target. Random data can be used to compute the RMSE/MAE/L2R errors, however it is recommended to use true data to get the final accuracy. For more details on the metrics, please refer to the X-CUBE-AI Embedded Documentation.

Note that accuracy check is important to compare a float model to a quantize model or when using the Weight compression feature of X-CUBE-AI for float models.



No categories assignedEdit