How to automatize code generation and validation with X-CUBE-AI CLI

This article describes how to automate the generation and validation of machine learning model code with the X-CUBE-AI command-line interface (CLI). The example is provided for Windows® with the use batch script. It can be adapted easily to another operating system or through Python™. More information on CLI can be found in the embedded documentation.

Info white.png Information
  • X-CUBE-AI is a software that generates optimized C code for STM32 microcontrollers and Neural Network inference. It is delivered under the Mix Ultimate Liberty+OSS+3rd-party V1 software license agreement (SLA0048).

1 Requirements and installations

  • X-CUBE-AI latest version (latest tested v7.2)
  • STM32CubeIDE latest version (latest tested v1.10.1)
  • STM32CubeProgrammer (STM32CubeProg) latest version (latest tested v2.11.0)
  • An STM32 evaluation tool; For this example, the NUCLEO-H723ZG Nucleo board is used.
  • A model (Keras .h5, TensorFlow™ Lite .tflite, or ONNX). For this example, we used a MobileNet v1 0.25 quantized with image inputs of 128x128x3.

The model, mobilenet_v1_0.25_128_quantized.tflite, is available for download.

Once X-CUBE-AI and STM32CubeIDE are installed, note the installation paths, which are usually the default ones on Windows® (replace "username" by your Windows® user account name and adapt to your tool version):

  • For X-CUBE-AI: C:\Users\username\STM32Cube\Repository\Packs\STMicroelectronics\X-CUBE-AI\7.2.0
  • For STM32CubeIDE: C:\ST\STM32CubeIDE_1.10.1\STM32CubeIDE\
  • For arm gcc installed through STM32CubeIDE: C:\ST\STM32CubeIDE_1.10.1\STM32CubeIDE\plugins\\tools\bin
  • For STM32CubeProgrammer: C:\Program Files\STMicroelectronics\STM32Cube\STM32CubeProgrammer\bin

2 Validation application generation

STM32CubeMX with the X-CUBE-AI plugin is used to generate the initial validation project for the targeted board.

2.1 Create a new project

Open STM32CubeMX and start the project using the board selector:

STM32CubeMX start project

Select the board to use, NUCLEO-H723ZG in the example, and create a project without initializing all peripherals with their default mode:

STM32CubeMX initialization

2.2 Add X-CUBE-AI software pack

Select X-CUBE-AI software pack "Core" and "Validation" application:

STM32CubeAI software pack

Click on X-CUBE-AI software pack:

STM32CubeAI software pack

If, by default, the peripherals parameters are not set for the best performance, the system warns you. Select "Yes" to make sure that the maximal frequency is used.

STM32CubeAI best performance

X-CUBE-AI configures the default parameters to set the best performance. It configures also the UART used to report performance.

2.3 Generate the validation application

Upload a model, such as mobilenet_v1_0.25_128_quantized.tflite in the example.

STM32CubeAI load model

In the example, the default memory options (kept from the version v7.2 of X-CUBE-AI) allocate input / output.

STM32CubeAI options

You can then generate the corresponding validation application as an STM32CubeIDE project in a dedicated directory, and then open it.

STM32CubeAI generate validation

In STM32CubeIDE, build the project.

build validation

To check that the validation application works as expected, first program the board (either through STM32CubeIDE or STM32CubeProgrammer), then reset the board. Open the validation on the target panel and either select the board COM port or keep the automatic detection. Do not select the automatic compilation and download option as the board is already programmed with the right firmware.

STM32CubeAI validation on target

Press "OK" and check that the validation runs correctly.

STM32CubeAI validation done

3 Automatic generation

The batch script below uses ST tools through the command-line interface to automatically:

  • get the full memory footprint of a given model
  • get the inference time
  • get the validation metrics (validation done on random data)
Warning white.png Important

The script might need to be adapted depending on the targeted configuration, for instance if the input model is a Keras quantized model, or if specific options are selected such as the use of external memory. Refer to the command-line interface section of the embedded documentation.

In the script below, the local working directory is "C:\Project", which must correspond to the directory of the initial validation application project. There are two modes, defined by the variable "mode".

  • The "analyze" mode, mode=analyze, does only an analysis to get the main memory footprint contributors (such as weights and activations buffers). In this mode, it is not needed to generate a validation application (section 2 can be skipped); Only X-CUBE-AI must be installed.
  • The "validation" mode, mode=validate (default), does a validation to get the full memory footprint (weights, activations, and runtime), as well as inference time and validation metrics.

In the script, a temporary directory "tmp" is generated with two sub directories "stm32ai_ws" and "stm32ai_output". These directories are used by the X-CUBE-AI stm32ai CLI executable as workspace and output directory. The main outputs and reports are copied from these directories to a result directory. In the example "C:\Project\Results\STM32H723\mobilenet_v1_0.25_128_quant.tflite".

The total memory footprint for a given model must not consider only the size of the weights of the model (stored in flash memory) and the size of the activations, input, and output buffers. The size of the runtime code (flash memory and RAM) must also be considered.

The runtime memory footprint can only be measured by building the code with a toolchain.

A convenient way to measure the runtime footprint is to generate the code using the relocatable option.

A relocatable binary model designates a binary object that can be installed and executed anywhere in an STM32 memory sub-system. It contains a compiled version of the generated NN C-files, including the requested forward kernel functions and the weights. The principal objective is to provide a flexible way to upgrade an AI-based application without regenerating and programming the whole end-user firmware. This is the primary element to use, for example, the FOTA (Firmware Over-The-Air) technology. Refer to the "Relocatable binary model support" section of the embedded documentation for more information.

The relocatable binary corresponds to the model runtime code and the related report provides all the needed information to get the full memory footprint. Below is a report generated with the relocatable option:

Runtime memory layout (series="stm32h7")
 section      size (bytes)                                        
 header                100*                    
 txt                33,716                      network+kernel    
 rodata                628                      network+kernel    
 data               14,884                      network+kernel    
 bss                   384                      network+kernel    
 got                   568*                    
 rel                 4,172*                    
 weights           478,804                      network           
 FLASH size        528,032 + 4,840* (+0.92%)   
 RAM size**         15,268 + 568* (+3.72%)     
 bin size          532,876                      binary image      
 act. size          67,120                      activations buffer
 (*)  extra bytes for relocatable support
 (**) Full RAM = RAM + act. + IO if not allocated in activations buffer

The fields marked with (*) must not be considered as they relate to specific code of the relocatable feature and not to the runtime code itself. The full flash memory size is then composed of the "weights" size, here 478,804 bytes, and of the sections "txt", "rodata", and "data" amounting to 49,228 bytes for a total of 528,032 bytes. The full RAM size is then composed of the activations (here 67,120 bytes), input and output buffers (if not allocated in activations buffer), and the sections "data" and "bss", amounting to 15,268 bytes, for a total of 82,388 bytes. Note that to gain RAM space, the "Use activation buffer for input buffer" and "Use activation buffer for the output buffer" options are selected in this example and therefore do not add any memory footprint. Otherwise, they must be added to the total RAM size. Another way is to build the validation application and analyze the memory map.

To get the inference time, the script generates the code for the given network, and replaces the needed file in the initial validation application. Then the validation application is rebuilt with STM32CubeIDE CLI. The generated firmware .elf is loaded on the board, which must be connected to the PC through the ST-LINK USB and identified though a COM port. Then the board is reset, and the validation application (with random data) is launched. The validation report provides the model inference time as well as the validation error metrics. The script copies the main information in the result .txt file:

weights (ro)          : 478,804 B (467.58 KiB) (1 segment) / -1,391,568(-74.4%) vs float model
weights size          : 478804 (1 segments)
activations (rw)      : 67,120 B (65.55 KiB) (1 segment) *
activations size      : 67120 (1 segments)
ram (total)           : 67,120 B (65.55 KiB) = 67,120 + 0 + 0
 FLASH size        528,032 + 4,840* (+0.92%)   
 RAM size**         15,268 + 568* (+3.72%)     
 (**) Full RAM = RAM + act. + IO if not allocated in activations buffer
  duration            : 45.774ms
  cycles/MACC         : 1.85
X-cross #1   n.a.   0.000772025   0.000057365   0.052570902   0.000007414   0.000772028   0.997128242   nl_30_0_conversion, ai_u8, (1,1,1,1001), m_id=[30]

Automatic code generation and validation with X-CUBE-AI CLI script:

Info white.png Information

Be aware that the imported model must have the same number of inputs and outputs as the initial model used to generate the STM32 validation project.

This snippet is provided AS IS, and by taking it, you agree to be bound to the license terms that can be found here for the component: Linker Scripts.
@echo off
Rem ******************************************************************************
Rem Copyright (c) 2022 STMicroelectronics. All rights reserved.
Rem ******************************************************************************
Rem This software component is provided to you as part of a software package and
Rem applicable license terms are in the  Package_license file. If you received this
Rem software component outside of a package or without applicable license terms,
Rem the terms of the BSD-3-Clause license shall apply. 
Rem You may obtain a copy of the BSD-3-Clause at:
Rem ******************************************************************************
Rem set the different paths to the tools and project location
Rem replace below username by your Windows user account name
set CUBE_FW_DIR=C:\Users\username\STM32Cube\Repository
set X_CUBE_AI_DIR=%CUBE_FW_DIR%\Packs\STMicroelectronics\X-CUBE-AI\7.2.0
set CUBE_IDE_DIR=C:\ST\STM32CubeIDE_1.10.1\STM32CubeIDE\
set ARM_GCC_DIR=C:\ST\STM32CubeIDE_1.10.1\STM32CubeIDE\plugins\\tools\bin
set CUBE_PROG_DIR=C:\Program Files\STMicroelectronics\STM32Cube\STM32CubeProgrammer\bin
set locpath=C:\Project
Rem name: to identify the target STM32
set name=STM32H723
Rem STM32 series: stm32
set series=stm32h7
Rem model: model complete file name (including extension)
set model=mobilenet_v1_0.25_128_quant.tflite
Rem prjval: name of the initial X-CUBE-AI validation project
set prjval=STM32H723_project
Rem Com port is optional if only one board is connected to the PC
Rem set cport=COM18
Rem mode: analyze => do only an analyze to get main memory footprint contributors (ie weights and activations buffer)
Rem mode: validate => do a validation to get full memory footprint (weights, activations and runtime) as well as inference time
set mode=validate
Rem create directory
mkdir %locpath%\Results\%name%
mkdir %locpath%\Results\%name%\%model%
mkdir %locpath%\tmp
mkdir %locpath%\tmp\stm32ai_ws
mkdir %locpath%\tmp\stm32ai_output
set result_dir=%locpath%\Results\%name%\%model%
Rem write all result in a specific text file result.txt
del %result_dir%\result.txt
echo %name% >> %result_dir%\result.txt
echo %model% >> %result_dir%\result.txt
if %mode%==analyze (
  Rem do only an analyze
  @echo on
  @echo Analyze %name% %model%
  stm32ai analyze -m %locpath%\%model% --allocate-inputs --allocate-outputs -w %locpath%\tmp\stm32ai_ws -o %locpath%\tmp\stm32ai_output
  copy /Y %locpath%\tmp\stm32ai_ws\network_report.json %result_dir%
  copy /Y %locpath%\tmp\stm32ai_output\network_analyze_report.txt  %result_dir%
  findstr /b "weights" %result_dir%\network_analyze_report.txt >> %result_dir%\result.txt
  findstr /b "activations" %result_dir%\network_analyze_report.txt >> %result_dir%\result.txt
  findstr /b "ram" %result_dir%\network_analyze_report.txt >> %result_dir%\result.txt
 @echo off
if %mode%==validate (
  Rem do only a full validation to get full memory footprint (weights, activations and runtime) as well as inference time
  @echo on
  @echo Get the total memory footprint of the model %model% for %name%
  stm32ai generate -m %locpath%\%model% --relocatable --allocate-inputs --allocate-outputs --series %series% -w %locpath%\tmp\stm32ai_ws -o %locpath%\tmp\stm32ai_output
  copy /Y %locpath%\tmp\stm32ai_ws\network_report.json %result_dir%
  copy /Y %locpath%\tmp\stm32ai_output\network_generate_report.txt  %result_dir%\network_generate_relocatable_report.txt
  findstr /b "weights" %result_dir%\network_generate_relocatable_report.txt >> %result_dir%\result.txt
  findstr /b "activations" %result_dir%\network_generate_relocatable_report.txt >> %result_dir%\result.txt
  findstr /b "ram" %result_dir%\network_generate_relocatable_report.txt >> %result_dir%\result.txt
  findstr "FLASH" %result_dir%\network_generate_relocatable_report.txt >> %result_dir%\result.txt
  findstr "RAM" %result_dir%\network_generate_relocatable_report.txt >> %result_dir%\result.txt
  @echo Get the inference time of the model %model% for %name%
  @echo Generate %name% %model%
  stm32ai generate -m %locpath%\%model% --allocate-inputs --allocate-outputs --lib %X_CUBE_AI_DIR%\Middlewares\ST\AI -w %locpath%\tmp\stm32ai_ws -o %locpath%\tmp\stm32ai_output
  copy /Y %locpath%\tmp\stm32ai_output\network*.c %locpath%\%prjval%\X-CUBE-AI\App\
  copy /Y %locpath%\tmp\stm32ai_output\network*.h %locpath%\%prjval%\X-CUBE-AI\App\
  @echo Building validation project %prjval%.elf
  stm32cubeidec.exe --launcher.suppressErrors -nosplash -application org.eclipse.cdt.managedbuilder.core.headlessbuild -data workspace -import %locpath%\%prjval% -cleanBuild %locpath%\%prjval%\Debug
  @echo Programming the board %locpath%\%prjval%\Debug\%prjval%.elf
  STM32_Programmer_CLI.exe -c port=SWD mode=UR -w %locpath%\%prjval%\Debug\%prjval%.elf
  Rem If the .bin file is preferred, the command-line must be adapted to add the start address
  Rem STM32_Programmer_CLI.exe -c port=SWD mode=UR -w %locpath%\%prjval%\Debug\%prjval%.bin 0x08000000
  STM32_Programmer_CLI.exe -c port=SWD -hardRst
  @echo Launch the validation application
  Rem stm32ai validate -m %locpath%\%model% -q %modelquant% --allocate-inputs --allocate-outputs --mode stm32 -d %cport% -w %locpath%\tmp\stm32ai_ws -o %locpath%\tmp\stm32ai_output
  stm32ai validate -m %model% --allocate-inputs --allocate-outputs --mode stm32 -w %locpath%\tmp\stm32ai_ws -o %locpath%\tmp\stm32ai_output
  copy /Y %locpath%\tmp\stm32ai_output\network_validate_report.txt %result_dir%
  findstr "duration"  %result_dir%\network_validate_report.txt >> %result_dir%\result.txt
  findstr "cycles/MACC" %result_dir%\\network_validate_report.txt >> %result_dir%\result.txt
  findstr "X-cross" %result_dir%\\network_validate_report.txt >> %result_dir%\result.txt
  @echo off

Copyright (c) 2022 STMicroelectronics. All rights reserved.