Deep Quantized Neural Network support

Revision as of 12:32, 2 July 2022 by Registered User

1. Overview

The stm32ai application can be used to deploy a pre-trained Deep Quantized Neural Network (DQNN) model. designed and trained with the QKeras and Larq libraries. The purpose of this article is to highlight the supported configurations and limitations to be able to deploy an efficient and optimized c-inference model for STM32 targets. For detailed explanations and recommendations to design a DQNN model checkout the respective user guide or provided notebook(s).

2. Deep Quantized Neural Network ?

The quantized model generally refers to models which use an 8-bit signed/unsigned integer data format to encode each weight and activation. After an optimization/quantization process (Post Training Quantization, PTQ or Quantization Aware Training, QAT fashion), it allows to deploy a floating-point network with arithmetic using smaller integers to be more efficient in terms of computational resources. DQNN denotes models that use the bit width less than 8 bits to encode some weight and/or activation. Mixed data types (hybrid layer) can be also considered for a given operator (i.e. weight in binary, activation in 8-bit signed integer or 32-b floating point) allowing to manage a trade-off in terms of accuracy/precision and memory peak usage. To ensure performance, QKeras and Larq libraries only train in a quantization-aware manner (QAT).

Each library is designed as an extension of the high-level Keras API (custom layer) that provides an easy way to create quickly a deep quantized version of original Keras network. As shown in the left part of the following figure, based on the concept of quantized layers and quantizers, the user can transform a full precision layer describing how to quantize incoming activations and weights. Note that the quantized layers are fully compatible with the Keras API so they can be used with Keras Layers interchangeably. This property allows you to design the mixed models with layers which are kept in float.

Note that after using a classical quantized model (8-b format only), the DQNN model can be considered as an advanced optimization approach/alternative to deploy a model in the resource-constrained environments like a STM32 device w/o significant loss of accuracy. “Advanced”, because the design of this type of model is by construction, not direct.

3. 1b and 8b signed format support

The ARM Cortex-M instructions and the requested data manipulations (pack/unpack data operations during memory transfers,..) do not allow to have an efficient implementation for all the combinations of data types. The stm32ai application focuses primarily on implementations to have improved memory peak usage (flash and/or ram) and reduced latency (execution time), which means that no optimization for size only option is supported. Therefore, only the 32b-float, 8-bit signed and binary signed (1-bit) data types are considered by the code generator to deploy the optimized c-kernels (see the next “Optimized C-kernels” section). Otherwise, if possible, fallback to 32-b floating point c-kernel will be used with pre-post quantizer/dequantized operations.

Figure 1: From quantized layer to deployed operator:

From quantized layer to deployed operator
  • Data type of the input tensors is defined by the 'input_quantizer' argument for Larq or inferred by the data type of the incoming operators (Larq/QKeras)
  • Data type of the output tensors is inferred by the outcoming operator-chain.
  • In the case, where the input/output and weight tensors are quantized with a classical 8-b integer scheme (like for the TFlite quantized models), the respective optimized int8 c-kernel implementations will be used. Idem for the full floating-point operators.

4. QKeras library

Qkeras is a quantization extension framework developed by Google. It provides drop-in replacement for some of the Keras layers, especially the ones that creates parameters and activation layers, and perform arithmetic operations, so that we can quickly create a deep quantized version of Keras network. QKeras is being designed to extend the functionality of Keras using Keras’ design principle, i.e. being user friendly, modular and extensible, adding to it being “minimally intrusive” of Keras native functionality. It provides also the QTools and AutoQKeras tools to assist the user to deploy a quantized model on a specific hardware implementation or to treat the quantization as a hyperparameter search in a Keras-tuner environment.

This snippet is provided AS IS, and by taking it, you agree to be bound to the license terms that can be found here for the component: Linker Scripts.
import tensorflow as tf
import qkeras
...
x = tf.keras.Input(shape=(28, 28, 1))
y = qkeras.QActivation(qkeras.quantized_relu(bits=8, alpha=1))(x)
y = qkeras.QConv2D(16, (3, 3),
        kernel_quantizer=qkeras.binary(alpha=1),
        bias_quantizer=qkeras.binary(alpha=1),
        use_bias=False,
        name="conv2d_0")(y)
y = tf.keras.layers.MaxPooling2D(pool_size=(2,2))(y)
y = tf.keras.layers.BatchNormalization()(y)
y = qkeras.QActivation(qkeras.binary(alpha=1))(y)
...
model = tf.keras.Model(inputs=x, outputs=y)

4.1. Supported QKeras quantizers/layers

  • QActivation
  • QBatchNormalization
  • QConv2D
  • QConv2DTranspose
    • 'padding' parameter should be 'valid'
    • 'stride' must be '(1, 1)'
  • QDense
    • Only 2D input shape is supported: [batch_size, input_dim]. A rank greater than 2 is not supported, Flatten layer before the QuantDense/QDense operator should be added
  • QDepthwiseConv2D

The following quantizers and associated configurations are supported:

Quantizer Comments / limitations
quantized_bits() only 8 bit size (bits=8) is supported
quantized_relu() only 8 bit size (bits=8) is supported
quantized_tanh() only 8 bit size (bits=8) is supported
binary() only supported in signed version (use_01=False), without scale (alpha=1)
stochastic_binary() only supported in signed version (use_01=False), without scale (alpha=1)

Figure 2: QKeras quantizers:

QKeras quantizers
  • Typically, 'quantized_relu()' quantizer can be used to quantize the inputs which are normalized between '0.0' and '1.0'. Note that 'quantized_relu(bits=8, integer=8)' can be considered if the range of the input values are between '0.0' and '256.0'.
This snippet is provided AS IS, and by taking it, you agree to be bound to the license terms that can be found here for the component: Linker Scripts.
x = tf.keras.Input(shape=(..))
y = qkeras.QActivation(qkeras.quantized_relu(bits=8, integer=0))(x)
y = qkeras.QConv2D(..)
...
  • The 'quantized_bits()' quantizer can be used to quantize the inputs which are normalized between '-1.0' and '1.0'. Note that 'quantized_bits(bits=8, integer=7)' can be considered if the range of the input values are between '-128.0' and '127.0'.
This snippet is provided AS IS, and by taking it, you agree to be bound to the license terms that can be found here for the component: Linker Scripts.
x = tf.keras.Input(shape=(..))
y = qkeras.QActivation(qkeras.quantized_bits(bits=8, integer=0, symmetric=0, alpha=1))(x)
y = qkeras.QConv2D(..)
...
  • To have a fully binarized operation without bias and a normalized and binarized output:
This snippet is provided AS IS, and by taking it, you agree to be bound to the license terms that can be found here for the component: Linker Scripts.
...
y = qkeras.QActivation(qkeras.binary(alpha=1))(y)
y = qkeras.QConv2D(..
    kernel_quantizer="binary(alpha=1)",
    bias_quantizer=qkeras.binary(alpha=1),
    use_bias=False,
    )(y)
y = keras.MaxPooling2D(...)(y)
y = keras.BatchNormalization()(y)
y = qkeras.QActivation(qkeras.binary(alpha=1))(y)
...

5. Larq library

Larq/ is an open-source Python Library for training neural networks with extremely low-precision weights and activations, such as Binarized Neural Networks (BNNs). The approach is similar to the QKeras library with a prilimirary focus on the BNN models. To deploy the trained model, a specific highly optimized inference engine (Larq Compute Engine, LCE) is also provided for various mobile platform.
This snippet is provided AS IS, and by taking it, you agree to be bound to the license terms that can be found here for the component: Linker Scripts.
...
import tensorflow as tf
import larq as lq
...
x = tf.keras.Input(shape=(28, 28, 1))
y = tf.keras.layers.Flatten()(x)
y = lq.layers.QuantDense(
        512,
        kernel_quantizer="ste_sign",
        kernel_constraint="weight_clip")(y)
y = lq.layers.QuantDense(
        10,
        input_quantizer="ste_sign",
        kernel_quantizer="ste_sign",
        kernel_constraint="weight_clip")(y)
y = tf.layers.Activation("softmax")(y)
...
model = tf.keras.Model(inputs=x, outputs=y)

5.1. Supported Larq layers

  • QuantConv2D
    • for binary quantization, 'pad_values=-1 or 1' is requested if 'padding="same"'
    • with 'DoReFa(..)' quantizer, 'use_bias==False' is expected
  • QuantDense
    • 'DoReFa(..)' quantizer is not supported
    • only 2D input shape is supported: [batch_size, input_dim]. A rank greater than 2 is not supported, Flatten layer before the QuantDense/QDense operator should be added
  • QuantDepthwiseConv2D
    • for binary quantization, 'pad_values=-1 or 1' is requested if 'padding="same"'
    • 'DoReFa(..)' quantizer is not supported

Only the following quantizers and associated configurations are supported. Larq quantizers are fully described in the Larq documentation section: “https://docs.larq.dev/larq/api/quantizers/”:

Quantizer Comments / limitations
'SteSign' used for binary quantization
'ApproxSign' used for binary quantization
'SwishSign' used for binary quantization
'DoReFa' only 8 bit size (k_bit=8) is supported for the QuantConv2D layer
  • Typically, 'DoReFa(k_bit=8, mode="activations")' quantizer can be used to quantize the inputs which are normalized between '0.0' and '1.0'. Note that 'DoReFa(k_bit=8, mode="weights")' quantizes the weights between '-1.0' and '1.0'.
This snippet is provided AS IS, and by taking it, you agree to be bound to the license terms that can be found here for the component: Linker Scripts.
x = tf.keras.Input(shape=(..))
y = larq.layers.QuantConv2D(..,
        input_quantizer=larq.Dorefa(k_bits=8, mode="activations",
        kernel_quantizer=larq.Dorefa(k_bits=8, mode="weights",
        use_bias=False,
        )(x)
...

6. Optimized C-kernel configurations

6.1. Implementation conventions

This section shows the optimized data-type combinations using the naming conversion:

  • f32 to identify absence of quantization (i.e., 32-b floating point)
  • s8 to refer the 8-b signed quantizers
  • s1 to refer to the binary signed quantizers

6.2. c-layout of the s1 type

Elements of a binary activation tensors are packed on 32-b words along the last dimenion ('axis=-1') with the following rules:

  • bit order: little or MSB first
  • pad value: '0b'
  • a positive value is coded with '0b', while a negative value is coded with '1b'
Binary tensor layout
Info white.png Information
It is recommended to have the number of channels as a multiple of 32 to optimize Flash/RAM size and MAC/Cycle, but it is not a must have.

6.3. Quantized Dense layers

input format output format weight format bias format (1) notes
s1 s1 s1 s1 (2)
s1 s1 s1 f32 (2)
s1 s1 s8 s1 (2)
s1 s8 s1 s8 (2)
s1 s8 s8 s8 (2)
s1 f32 s1 s1 (2)
s1 f32 s1 f32 (2)
s1 f32 s8 s8 (2)
s1 f32 f32 f32 (2)
s8 s1 s1 s1 (2)
s8 s8 s1 s1 (2)
s8 f32 s1 s1 (2)
s8 s8 s8 s8 (2), bias are stored as s32 format
s8 s8 s8 s32 (2), int8-tflite kernels
s8 f32 s1 s1 (2)
f32 s1 s1 s1 (2)
f32 s1 s1 f32 (2)
f32 f32 s1 s1 (2)
f32 f32 s1 f32 (2)

(1) usage of the bias is optional (2) batch-normalization can be fused

6.4. Optimized Convolution layers

input format output format weight format bias format (1) notes
s1 s1 s1 s1 (2), (3), including pointwise and depthwise version
s1 s8 s1 s8 (2), including pointwise version
s1 s8 s1 f32 (2), including pointwise version
s1 f32 s1 f32 (2), including pointwise version
s8 s1 s8
s8 (Dorefa) s1 s8 (Dorefa) (2), use_bias=False
s8 s8 s8 s8 (2) bias are stored as s32 format
s8 s8 s8 s32 (2) int8-tflite kernels

(1) usage of the bias is optional (2) batch-normalization can be fused (3) maxpool can be inserted between the convolution and the batch-normalization operators

6.5. Misc layers

Following layers are also available to support the more complex topology, for example with the residual connections.

layer input format output format notes
maxpool s1 s1 s8/f32 data type is also supported by the “standard” C-kernels
concat s1 s1 s8/f32 data type is also supported by the “standard” C-kernels

7. Evidence of efficient code generation

Similar to the 'qkeras.print_qstats()' function or the extended 'summary()' function in Larq, the analyze command reports a summary of the number of operations used for each generated C-layer according the type of datas. The number of operation types for the entire generated C-model is also reported. This last information makes it possible to know if the deployed model is entirely or partially based on the optimiezd binarized/quantized c-kernels.

Info white.png Information
For the size of the deployed weights, 'ROM'/'weights (ro)' metric indicates the espected size to store the quantized weights on the target. Note that the reported value is compared with the size to store the weights from the original format (32-b floating point). Detailed information by c-layer and by associated tensors are available in the generated reports.

Following example shows that 90% of the operations are the binary operations provided by the main contributor: "quant_conv2d_1_conv2d" layer.

$ stm32ai <model_file.h5>
...
 params #             : 93,556 items (365.45 KiB)
 macc                 : 2,865,718
 weights (ro)         : 14,496 B (14.16 KiB) (1 segment) / -359,728(-96.1%) vs float model
 activations (rw)     : 86,528 B (84.50 KiB) (1 segment)
 ram (total)          : 89,704 B (87.60 KiB) = 86,528 + 3,136 + 40
...
 Number of operations and param per c-layer
 -------------------------------------------------------------------------------------------
 c_id    m_id   name (type)                                  #op (type)
 -------------------------------------------------------------------------------------------
 0       2      quant_conv2d_conv2d (conv2d)                         194,720 (smul_f32_f32)
 1       3      quant_conv2d_1_conv (conv)                            43,264 (conv_f32_s1)
 2       1      max_pooling2d (pool)                                  21,632 (op_s1_s1)
 3       3      quant_conv2d_1_conv2d (conv2d_dqnn)                2,230,272 (sxor_s1_s1)
 4       5      max_pooling2d_1 (pool)                                 6,400 (op_s1_s1)
 5       7      quant_conv2d_2_conv2d (conv2d_dqnn)                  331,776 (sxor_s1_s1)
 6       10     quant_dense_quantdense (dense_dqnn_dqnn)              36,864 (sxor_s1_s1)
 7       13     quant_dense_1_quantdense (dense_dqnn_dqnn)               640 (sxor_s1_s1)
 8       15     activation (nl)                                          150 (op_f32_f32)
 -------------------------------------------------------------------------------------------
 total                                                             2,865,718

   Number of operation types
   ---------------------------------------------
   smul_f32_f32             194,720        6.8%
   conv_f32_s1               43,264        1.5%
   op_s1_s1                  28,032        1.0%
   sxor_s1_s1             2,599,552       90.7%
   op_f32_f32                   150        0.0%