How to measure machine learning model power consumption with STM32Cube.AI generated application

Revision as of 09:04, 30 August 2021 by Registered User

This article is describing how to easily modify the system performance application generated thanks to STM32Cube.AI to run power and energy measurements in optimal configuration.

The system performance application allows to run automatically inferences of a machine learning processing generated thanks to STM32Cube.AI (neural network or traditional machine learning models). It allows to measure directly on the target the inference time. It can also be used to measure power consumption. However, the default settings are not fully optimal to ensure accurate measures of only the processing excluding peripherals and power leakages on unused GPIO. As an example, we will take the NUCLEO-L4R5ZI, but the process can be adapted to any board supported by STM32Cube.AI.

Info white.png Information
  • STM32Cube.AI is a software aiming at the generation of optimized C code for STM32 and neural network inference. It is delivered under the Mix Ultimate Liberty+OSS+3rd-party V1 software license agreement (SLA0048).

1. Prerequisites

1.1. Hardware

1.2. Software

2. Project generation

2.1. Loading a pre-defined ioc

The next section describes how to start from STM32CubeMX to generate the project. We will provide soon pre-defined STM32CUbeMX project Files ioc for some boards on our GitHub. You can load them directly to the Import your model section and then go directly to the import model section. To load an ioc, select Files / Load Project:

2.2. Create a new project

Open STM32CubeMX and start project using the board selector:

Select the board to use, in our case the NUCLEO-L4R5ZI and create a project without initializing all peripherals with their default Mode:

2.3. Add X-CUBE-AI software pack

Select X-CUBE-AI software pack Core and System Performance application:

Click on X-CUBE-AI software pack:

If by default the peripherals parameters are not set to the best performance, the system will warn you. Select yes to make sure to use the maximal frequency.

X-CUBE-AI will configure default parameters to set the best performance as well as configuring the UART used to report performances.

You can check which UART will be used by X-CUBE-AI to communicate with the board.it is the UART connected to the STLink embedded device which is seen from the PC as a Virtual Com Port once connected by USB. To do so open the Platform Settings panel:

For the NUCLEO-L4R5ZI it is the LPUART1. You can also check the settings used by X-CUBE-AI for the specified UART in the Connectivity panel / Parameter Settings:

These settings must be used on PC the hyper terminal to communicate with the board when the system performance application is running.

Identify the GPIOs used by the UART in the GPIO Settings panel:

For the NUCLEO-L4R5ZI, it is the PG7 and PG8 (Pin 7 and 8 of bank G).

2.4. Check / modify clock configuration

You can also eventually open and modify the clock configuration for instance to select a specific HCLK frequency or change the clock source (on STM32L4R5ZI Nucleo for instance from HSI to MSI). For conveniency when setting the GPIOs, it is recommended to select HSI as source clock. If HSE (external clock) is selected, make sure not to reset GPIOs connected to the external crystal RCC_OSC_IN and RCC_OSC_OUT. Note that on Nucleo board by default the HSE crystal is generally not mounted. You can also check in the System Core / RCC panel the clock setting and especially the Power Regulator Voltage Scale (see Important notes section for STM32H747 case).

When using SMPS for power supply, make sure the right power regulator is selected for the right frequency.

2.5. Reset not necessary GPIOs

On the “Pinout & Configuration” view, reset all the unused GPIOs. All pins can be put in reset state except the STLINK_RX and STLINK_TX UART pins (PG7 and PG8 for NUCLEO-L4R5ZI configured by X-CUBE-AI), NRST and voltages pins as well as RCC_OSC_IN and RCC_OSC_OUT only if the HSE is selected. On the NUCLEO-L4R5ZI example, it means going tfrom the following configuration:

to the following one:

2.6. Import your model in X-CUBE-AI

As usual with X-CUBE-AI, import the model you want to analyze:

To optimize the RAM usage, it is advised to select the "Use activation buffer for input buffer" and "Use activation buffer for the output buffer" options in Advanced Settings panel:

You can also run an Analyze to get the memory footprint of the model:


2.7. System Performance code generation

Open the Project Manager tab and specify a project name. Select the IDE, in this example we will use STM32Cube IDE:

Select the Code Generator tab and check the option “Set all free pins as analog (to optimize the power consumption)”:

3. Project modification

Once generated you can open the project with STM32Cube IDE:

The ioc can be reused to recover the settings in CubeMX and modify or generate new projects with new models without the need to reconfigure everything. You will find soon a set of pre-configured iocs in our GitHub. You can do a first build of the project to check the generation.

Add to main.c (located in Core/Src) between the two tags “/* USER CODE BEGIN 0 */” and “/* USER CODE END 0 */” the following functions for the NUCLEO-L4R5ZI:

This snippet is provided AS IS, and by taking it, you agree to be bound to the license terms that can be found here for the component: Application.
/* USER CODE BEGIN 0 */
/**
  * @brief Disable the clock of all GPIOs
  * @param None
  * @retval None
  */
void MX_GPIO_Disable(void)
{
	
  __HAL_RCC_GPIOA_CLK_DISABLE();
  __HAL_RCC_GPIOB_CLK_DISABLE();
  __HAL_RCC_GPIOC_CLK_DISABLE();
  __HAL_RCC_GPIOD_CLK_DISABLE();
  __HAL_RCC_GPIOE_CLK_DISABLE();
  __HAL_RCC_GPIOF_CLK_DISABLE();
  __HAL_RCC_GPIOG_CLK_DISABLE();
  __HAL_RCC_GPIOH_CLK_DISABLE();
}

/**
  * @brief Disable the VCOM UART
  * @param None
  * @retval None
  */
void MX_UARTx_DeInit(void)
{
  HAL_UART_DeInit(&hlpuart1);
  GPIO_InitTypeDef GPIO_InitStruct = {0};

  /*Configure GPIO pins : PG7, PG8 */
  GPIO_InitStruct.Pin = GPIO_PIN_7|GPIO_PIN_8;
  GPIO_InitStruct.Mode = GPIO_MODE_ANALOG;
  GPIO_InitStruct.Pull = GPIO_NOPULL;
  HAL_GPIO_Init(GPIOG, &GPIO_InitStruct);
}
/* USER CODE END 0 */

For other boards than NUCLEO-L4R5ZI, you need to adapt to:

  • The number and identifications of GPIOs bank, you can simply refer to the function MX_GPIO_Init in the main.c file which is enabling the clock of all the GPIOs to set them to analog mode.
  • The UART handle used for the Virtual Com Port. It is also specified in the main.c where the UART is configured and enabled. Refer to the “Private variable” and the UART_HandleTypeDef used, for the NUCLEO-L4R5ZI "hlpuart1".
  • The GPIOs used for the UART communication, identified in the “Identify the GPIOs used by the UART in the GPIO Settings panel” section.

By placing the function between the tags /* USER CODE BEGIN 0 */ and /* USER CODE END 0 */, the code will be kept even if you regenerate the project for instance to test another model.

Modify the file app_x-cube-ai.c (located in X-CUBE-AI/App): Replace the full function ai_mnetwork_run: by:

This snippet is provided AS IS, and by taking it, you agree to be bound to the license terms that can be found here for the component: Application.
AI_API_ENTRY
ai_i32 ai_mnetwork_run(ai_handle network, const ai_buffer* input,
        ai_buffer* output) {}
By:
extern void MX_UARTx_DeInit(void);
extern void MX_GPIO_Disable(void);
#define AI_MIN_LOOP 16

AI_API_ENTRY
ai_i32 ai_mnetwork_run(ai_handle network, const ai_buffer* input, ai_buffer* output)
{
  struct network_instance* inn;
  static ai_i32 Counter = 0;
  if (inn == NULL)
    return 0;
  if (Counter < AI_MIN_LOOP)
  {
    Counter++;
    return inn->entry->ai_run(inn->handle, input, output);
  }
  else
  {
    printf("\nStarting infinite power measurement loop\n");
    MX_UARTx_DeInit();
    MX_GPIO_Disable();
    while(1)
    {
      inn->entry->ai_run(inn->handle, input, output);
    }
  } 
}

Build and load the firmware in the STM32 as usual using STM32Cube IDE or STM32Cube Programmer.

The system application will run a full loop of 16 inferences to get average inference time as usual and then disable the UART and the GPIO’s clocks to enter in an infinite loop processing the inference and allowing then to measure the power consumption. Note that this code should be copied again each time the project is regenerated through STM32Cube.AI.

4. Power Measurement setup

Nucleo boards with STM32 with a power consumption below 50 mA The recommended way to measure power consumption is to use ST power shield X-NUCLEO-LPM01A https://www.st.com/en/evaluation-tools/x-nucleo-lpm01a.html.

Characteristics:

  • Programmable voltage source from 1.8 V to 3.3 V
  • Static current measurement from 1 nA to 200 mA
  • Dynamic measurements:
    • Current from 100 nA to 50 mA
    • Power measurement from 180 nW to 165 mW
    • Energy measurement computation by power measurement time integration
    • Execution of EEMBC ULPMark™ tests
  • Mode standalone:
    • Monochrome LCD, 2 lines of 16 characters with backlight
    • 4-direction joystick with selection button
    • Enter and Reset push-buttons
  • Mode controlled:
    • Connection to a PC through USB FS micro-B receptacle
    • Command line (virtual COM port) or STM32CubeMonitor-Power PC tool

4.1. Board HW setup


  • Connect the power shield to the board: simply remove the IDD Jumper and connect the out pin of the white connector of the power shield (3rd pin from the right) to the pin of the IDD connector connected to the STM32 (left pin on the Nucleo) and connect the GND pin of the power shield to a GND pin of the Nucleo.
  • Power up the power shield through the USB.
  • Connect the Nucleo board as usual to the PC through the USB cable.
  • Using the joystick select the appropriate voltage. Measures shall be done at 3 V (default) or 3.3V to ensureUART communication to the STLink and then to the PC through Virtual Com port.
  • For Ultra Low Power STM32 like the STM32L4R5ZI, current is below 50 mA and dynamic mode can be used (default mode).
  • For high performance STM32 like the STM32H7, current is above 50 mA and < 200 mA, the static mode shall be selected (use the joystick to select it).
  • Press the entre button to power the STM32.
  • As the STLink is powered separately, the STM32 firmware can be loaded as usual and VCom can be opened.

Controlled mode For STM32 which does not require more than 50 mA, the power shield can be used in controlled mode

  • For controlled mode, download and install STM32CubeMonitor-Power.
  • Follow the user manual instructions to connect and control the power shield.
  • Cube Monitor Power allows to display / save / compute average value on a given window of the measured current over time.


Alternatively, to the NUCLEO-LPM01A power shield, an amperemeter can be used removing the IDD jumper to connect it:

4.2. Inference time report

Once the STM32 board is connected to the PC through USB (USB PWR connector, not the User USB), a hyper terminal can be opened to get messages from the boards. The STM32 is identified by a COM port. Open a hyper terminal like TeraTerm and configure the serial port configuration as follow:

  • Port: the STM32 com port number
  • Speed: 115200
  • Data: 8 bits
  • Parity: none
  • Stop bits: 1 bit
  • Flow control: none

From board reset, the board will report the results of the average inference time :

#
# AI system performance measurement 5.2
#
Compiled with GCC 9.3.1
STM32 Runtime configuration...
 Device       : DevID:0x0470 (STM32L4Rxxx) RevID:0x1003
 Core Arch.   : M4 - FPU PRESENT and used
 HAL version  : 0x010d0000
 system clock : 120 MHz
 FLASH conf.  : ACR=0x00000605 - Prefetch=False $I/$D=(True,True) latency=5
 Calibration  : HAL_Delay(1)=1.002 ms

AI platform (API 1.1.0 - RUNTIME 7.0.0)
Discovering the network(s)...

Found network "network"
Creating the network "network"..
Initializing the network
Network informations...
 model name         : network
 model signature    : 7c2d7e42ad7cf646bfa1597f40fedd81
 model datetime     : Mon Aug  2 13:14:24 2021
 compile datetime   : Aug  2 2021 13:22:51
 runtime version    : 7.0.0
 tools version      : 7.0.0
 complexity         : 13597156 MACC
 c-nodes            : 31
 activations        : 67136 bytes (0x20003660)
 weights            : 478804 bytes (0x080105e0)
 inputs/outputs     : 1/1
  I[0]  u8, scale=0.007812, zero=128, 49152 bytes, shape=(128,128,3) (@0x20003dac)
  O[0]  u8, scale=0.003906, zero=0, 1001 bytes, shape=(1,1,1001) (@0x20003660)

Running PerfTest on "network" with random inputs (16 iterations)...
................

Results for "network", 16 inferences @120MHz/120MHz (complexity: 13597156 MACC)
 duration     : 427.337 ms (average)
 CPU cycles   : 51280470 (average)
 CPU Workload : 42% (duty cycle = 1s)
 cycles/MACC  : 3.77 (average for all layers)
 used stack   : 840 bytes
 used heap    : 0:0 0:0 (req:allocated,req:released) max=0 cur=0 (cfg=0)
 observer res : 520 bytes used from the heap (31 c-nodes)

 Inference time by c-node
  kernel  : 427.149ms (time passed in the c-kernel fcts)
  user    : 0.320ms (time passed in the user cb)

 c_id  type                id       time (ms)
 ---------------------------------------------------
 0     CONV2D              0         62.275  14.58 %
 1     CONV2D              1         41.909   9.81 %
 …
 30    NL                  30         0.133   0.03 %
 -------------------------------------------------
                                    427.149 ms

Running PerfTest on "network" with random inputs (16 iterations)...

Starting infinite power measurement loop

From the ”Starting infinite power measurement loop” message, the code will loop on the inference processing and you can measure the power consumption and compute the energy per inference.

4.3. Measure process

4.3.1. Measure process when current is below 50 mA

A measure process method is to use the X-NUCLEO-LPM01A power shield in controlled mode by STM32CubeMonitor-Power. From STM32CubeMonitor-Power, take control of the power shield, set a high sampling frequency typically 10000 Hz, a first acquisition period (for instance 10 s) and a input voltage of 3.3 V or 3V. Open a hyper terminal to get the log from the board and start the acquisition on STM32CubeMonitor-Power. Wait until you get the “Starting infinite power measurement loop” message from the board log.

From the board log, retrieve the average inference time computed on 16 inferences of the neural network. Start another acquisition on a fix period to cover several inference periods, typically 10s and click on “Show Report” to get the average current on the 10 s.

From the report on the UART you have the average inference time of t second, from the STM32CubeMonitor-Power you have the average current of i Ampere for a given input voltage of u Volt. The average energy is easily computed as (t x i x u) in Joule.

4.3.2. Measure process when current is above 200 mA

To get accurate measures it is recommended to use an amperemeter capable to provide average current consumption on a given period. For a first estimation, you can use the X-NUCLEO-LPM01A power shield in “Static” mode. Make sure to wait for the message “Starting infinite power measurement loop” from the board log and then the power shield display will provide current static measures. You can then estimate the average current consumption.

5. Power measure at 1.8 V

The X-NUCLEO-LPM01A power shield can power the board at different voltages, typically 1.8 V, 3 V or 3.3V and when connected as describe in the section Board HW setup will supply directly the STM32.

To use 1.8 V, make sure to modify accordingly the settings RCC/System Parameters in the System Core category before generating the project. Some parameters like the Flash latency will then automatically adapted is needed.

The STLink embedded in ST evaluation board is powered at 3.3 V. The STLink is used to program the STM32 but also to provide the Virtual Com capability. The STM32 is connected through UART to the STLink connected itself to the PC through USB allowing communication between the STM32 and the PC on a given Com port. If the STM32 is powered at 1.8V, the communication between the STLink and the as well as the STM32 programming are challenged. To avoid issue with programming, load the firmware without the power shield connected so in default board conditions at 3.3 V. Then the power shield can be connected. It might for instance happen that the STM32 log is not well captured on the hyper terminal. If it is the case, you can easily measure the inference time thanks to the ST32CubeMonitor-Power tool (in case current is below 50 mA). Do current acquisition as in 3.3 V case but making sure configure the input voltage at 1800 mV. In power measure established regime, the code will loop infinitely on the inference, it then easy to identify the power pattern on the graphic and then get a measure of the inference time. You can average on several windows if needed. The inference time should be in the same order of magnitude than at 3.3 V.

For current above 50 mA, same methodology can be applied with another advance amperemeter. With the X-NUCLEO-LPM01A power shield, you can get an estimation of the inference time as the one measured at 3.3 V and the get the current measures in Static mode setting the voltage to 1.8 V (otherwise same methodology as at 3.3 V).

Alternative, to get the UART working at the right voltage, you can use a free available UART instead of going through the STLink (you will have to change the UART default settings defined automatically by STM32Cube.AI). You can then use USB TTL Serial Cable operating at 1.8 V as sold by [tps://ftdichip.com/products/ttl-232rg-vreg1v8-we FTDI Chip]. Or use a level shifter board like the 4-channel BSS138 devices, as sold by AdaFruit to connect to a STLink (from a Nucleo board or an external STLink). More information on alternative energy measure setup can be found on EEMBC GitHub at the following link: https://github.com/eembc/energyrunner#hardware-setup.

6. Important notes

6.1. Specific case of STM32H747 Discovery board

The STM32H747 has internal SMPS, but this power supply option can only be used up to 400 MHz and with specific HW configuration. With 480 MHz, the LDO option shall be used. The STM32H747 is configured by default to be able to run on SMPS, but it is therefore limited to 400 MHz. The board can be configured to 480 MHz but needs to be reconfigured with specific jumpers. Be cautious too when configured to 480 MHz to use the appropriate SW configuration. Power efficiency will be optimal in SMPS mode. When configuring the STM32H747 Discovery board in CubeMX, make sure to set the HCLK frequency to 400 MHz and that the Power Parameters SupplySource is set to PWR_DIRECT_SMPS_SUPPLY in the System Core / RCC panel.

6.2. External SMPS

Some boards like the NUCLEO-L4R5ZI-P embeds external SMPS, but to enable it a specific initialization sequence is required. X-CUCE-AI is not enabling this specific mode, so be careful that the SMPS will not be enabled following the tutorial above.

6.3. Power measures

Note that power / energy measures will vary depending on temperature, the STM32 sample itself, the measurement material and its setting. On a final application, the power consumption will vary depending on the code, memory placement and even input data. Note that with the tutorial above, the input data are random but once in current measurement established regime they will not change and stay identical during the full sequence. The measures will vary from one trial to another, but should remain in the same order of magnitude.