Deep Quantized Neural Network support

Revision as of 11:06, 2 July 2022 by Registered User

1. Overview

The stm32ai application can be used to deploy a pre-trained Deep Quantized Neural Network (DQNN) model. designed and trained with the QKeras and Larq libraries. The purpose of this article is to highlight the supported configurations and limitations to be able to deploy an efficient and optimized c-inference model for STM32 targets. For detailed explanations and recommendations to design a DQNN model checkout the respective user guide or provided notebook(s).

2. Deep Quantized Neural Network ?

The quantized model generally refers to models which use an 8-bit signed/unsigned integer data format to encode each weight and activation. After an optimization/quantization process (Post Training Quantization, PTQ or Quantization Aware Training, QAT fashion), it allows to deploy a floating-point network with arithmetic using smaller integers to be more efficient in terms of computational resources. DQNN denotes models that use the bit width less than 8 bits to encode some weight and/or activation. Mixed data types (hybrid layer) can be also considered for a given operator (i.e. weight in binary, activation in 8-bit signed integer or 32-b floating point) allowing to manage a trade-off in terms of accuracy/precision and memory peak usage. To ensure performance, QKeras and Larq libraries only train in a quantization-aware manner (QAT).

Each library is designed as an extension of the high-level Keras API (custom layer) that provides an easy way to create quickly a deep quantized version of original Keras network. As shown in the left part of the following figure, based on the concept of quantized layers and quantizers, the user can transform a full precision layer describing how to quantize incoming activations and weights. Note that the quantized layers are fully compatible with the Keras API so they can be used with Keras Layers interchangeably. This property allows you to design the mixed models with layers which are kept in float.

Note that after using a classical quantized model (8-b format only), the DQNN model can be considered as an advanced optimization approach/alternative to deploy a model in the resource-constrained environments like a STM32 device w/o significant loss of accuracy. “Advanced”, because the design of this type of model is by construction, not direct.

3. 1b and 8b signed format support

The ARM Cortex-M instructions and the requested data manipulations (pack/unpack data operations during memory transfers,..) do not allow to have an efficient implementation for all the combinations of data types. The stm32ai application focuses primarily on implementations to have improved memory peak usage (flash and/or ram) and reduced latency (execution time), which means that no optimization for size only option is supported. Therefore, only the 32b-float, 8-bit signed and binary signed (1-bit) data types are considered by the code generator to deploy the optimized c-kernels (see the next “Optimized C-kernels” section). Otherwise, if possible, fallback to 32-b floating point c-kernel will be used with pre-post quantizer/dequantized operations.

Figure 1: From quantized layer to deployed operator

  • Data type of the input tensors is defined by the 'input_quantizer' argument for Larq or inferred by the data type of the incoming operators (Larq/QKeras)
  • Data type of the output tensors is inferred by the outcoming operator-chain.
  • In the case, where the input/output and weight tensors are quantized with a classical 8-b integer scheme (like for the TFlite quantized models), the respective optimized int8 c-kernel implementations will be used. Idem for the full floating-point operators.

4. QKeras library

QKeras is a quantization extension framework developed by Google. It provides drop-in replacement for some of the Keras layers, especially the ones that creates parameters and activation layers, and perform arithmetic operations, so that we can quickly create a deep quantized version of Keras network. QKeras is being designed to extend the functionality of Keras using Keras’ design principle, i.e. being user friendly, modular and extensible, adding to it being “minimally intrusive” of Keras native functionality. It provides also the QTools and AutoQKeras tools to assist the user to deploy a quantized model on a specific hardware implementation or to treat the quantization as a hyperparameter search in a Keras-tuner environment.

This snippet is provided AS IS, and by taking it, you agree to be bound to the license terms that can be found here for the component: Linker Scripts.
import tensorflow as tf
import qkeras
...
x = tf.keras.Input(shape=(28, 28, 1))
y = qkeras.QActivation(qkeras.quantized_relu(bits=8, alpha=1))(x)
y = qkeras.QConv2D(16, (3, 3),
        kernel_quantizer=qkeras.binary(alpha=1),
        bias_quantizer=qkeras.binary(alpha=1),
        use_bias=False,
        name="conv2d_0")(y)
y = tf.keras.layers.MaxPooling2D(pool_size=(2,2))(y)
y = tf.keras.layers.BatchNormalization()(y)
y = qkeras.QActivation(qkeras.binary(alpha=1))(y)
...
model = tf.keras.Model(inputs=x, outputs=y)

4.1. Supported QKeras quantizers/layers

  • QActivation
  • QBatchNormalization
  • QConv2D
  • QConv2DTranspose
    • 'padding' parameter should be 'valid'
    • 'stride' must be '(1, 1)'
  • QDense
    • Only 2D input shape is supported: [batch_size, input_dim]. A rank greater than 2 is not supported, Flatten layer before the QuantDense/QDense operator should be added
  • QDepthwiseConv2D

The following quantizers and associated configurations are supported:

Quantizer Comments / limitations
quantized_bits() only 8 bit size (bits=8) is supported
quantized_relu() only 8 bit size (bits=8) is supported
quantized_tanh() only 8 bit size (bits=8) is supported
binary() only supported in signed version (use_01=False), without scale (alpha=1)
stochastic_binary() only supported in signed version (use_01=False), without scale (alpha=1)