Introducing Google AI Edge Portal: Benchmark Edge AI at scale. Sign-up to request access during private preview.

LiteRT 8-bit quantization specification

The following document outlines the specification for LiteRT's 8-bit quantization scheme. This is intended to assist hardware developers in providing hardware support for inference with quantized LiteRT models.

Specification summary

We are providing a specification, and we can only provide some guarantees on behaviour if the spec is followed. We also understand different hardware may have preferences and restrictions that may cause slight deviations when implementing the spec that result in implementations that are not bit-exact. Whereas that may be acceptable in most cases (and we will provide a suite of tests that to the best of our knowledge include per-operation tolerances that we gathered from several models), the nature of machine learning (and deep learning in the most common case) makes it impossible to provide any hard guarantees.

8-bit quantization approximates floating point values using the following formula.

\[real\_value = (int8\_value - zero\_point) \times scale\]

Per-axis (aka per-channel in Conv ops) or per-tensor weights are represented by int8 two’s complement values in the range [-127, 127] with zero-point equal to 0. Per-tensor activations/inputs are represented by int8 two’s complement values in the range [-128, 127], with a zero-point in range [-128, 127].

There are other exceptions for particular operations that are documented below.

Signed integer vs unsigned integer

LiteRT quantization will primarily prioritize tooling and kernels for int8 quantization for 8-bit. This is for the convenience of symmetric quantization being represented by zero-point equal to 0. Additionally many backends have additional optimizations for int8xint8 accumulation.

Per-axis vs per-tensor

Per-tensor quantization means that there will be one scale and/or zero-point per entire tensor. Per-axis quantization means that there will be one scale and/or zero_point per slice in the quantized_dimension. The quantized dimension specifies the dimension of the Tensor's shape that the scales and zero-points correspond to. For example, a tensor t, with dims=[4, 3, 2, 1] with quantization params: scale=[1.0, 2.0, 3.0], zero_point=[1, 2, 3], quantization_dimension=1 will be quantized across the second dimension of t:

t[:, 0, :, :] will have scale[0]=1.0, zero_point[0]=1
t[:, 1, :, :] will have scale[1]=2.0, zero_point[1]=2
t[:, 2, :, :] will have scale[2]=3.0, zero_point[2]=3

Often, the quantized_dimension is the output_channel of the weights of convolutions, but in theory it can be the dimension that corresponds to each dot-product in the kernel implementation, allowing more quantization granularity without performance implications. This has large improvements to accuracy.

TFLite has per-axis support for a growing number of operations. At the time of this document, support exists for Conv2d and DepthwiseConv2d.

Symmetric vs asymmetric

Activations are asymmetric: they can have their zero-point anywhere within the signed int8 range [-128, 127]. Many activations are asymmetric in nature and a zero-point is an relatively inexpensive way to effectively get up to an extra binary bit of precision. Since activations are only multiplied by constant weights, the constant zero-point value can be optimized pretty heavily.

Weights are symmetric: forced to have zero-point equal to 0. Weight values are multiplied by dynamic input and activation values. This means that there is an unavoidable runtime cost of multiplying the zero-point of the weight with the activation value. By enforcing that zero-point is 0 we can avoid this cost.

Explanation of the math: this is similar to section 2.3 in arXiv:1712.05877, except for the difference that we allow the scale values to be per-axis. This generalizes readily, as follows:

$A$ is a $m \times n$ matrix of quantized activations.
$B$ is a $n \times p$ matrix of quantized weights.
Consider multiplying the $j$th row of $A,ドル $a_j$ by the $k$th column of $B,ドル $b_k,ドル both of length $n$. The quantized integer values and zero-points values are $q_a,ドル $z_a$ and $q_b,ドル $z_b$ respectively.

\[a_j \cdot b_k = \sum_{i=0}^{n} a_{j}^{(i)} b_{k}^{(i)} = \sum_{i=0}^{n} (q_{a}^{(i)} - z_a) (q_{b}^{(i)} - z_b) = \sum_{i=0}^{n} q_{a}^{(i)} q_{b}^{(i)} - \sum_{i=0}^{n} q_{a}^{(i)} z_b - \sum_{i=0}^{n} q_{b}^{(i)} z_a + \sum_{i=0}^{n} z_a z_b\]

The $\sum_{i=0}^{n} q_{a}^{(i)} q_{b}^{(i)}$ term is unavoidable since it’s performing the dot product of the input value and the weight value.

The $\sum_{i=0}^{n} q_{b}^{(i)} z_a$ and $\sum_{i=0}^{n} z_a z_b$ terms are made up of constants that remain the same per inference invocation, and thus can be pre-calculated.

The $\sum_{i=0}^{n} q_{a}^{(i)} z_b$ term needs to be computed every inference since the activation changes every inference. By enforcing weights to be symmetric we can remove the cost of this term.

int8 quantized operator specifications

Below we describe the quantization requirements for our int8 tflite kernels:

ADD
Input0:
data_type:int8
range:[-128,127]
granularity:per-tensor
Input1:
data_type:int8
range:[-128,127]
granularity:per-tensor
Output0:
data_type:int8
range:[-128,127]
granularity:per-tensor
AVERAGE_POOL_2D
Input0:
data_type:int8
range:[-128,127]
granularity:per-tensor
Output0:
data_type:int8
range:[-128,127]
granularity:per-tensor
restriction:Inputandoutputsmustallhavesamescale/zero_point
CONCATENATION
Input...:
data_type:int8
range:[-128,127]
granularity:per-tensor
Output0:
data_type:int8
range:[-128,127]
granularity:per-tensor
restriction:Inputandoutputsmustallhavesamescale/zero_point
CONV_2D
Input0:
data_type:int8
range:[-128,127]
granularity:per-tensor
Input1(Weight):
data_type:int8
range:[-127,127]
granularity:per-axis(dim=0)
restriction:zero_point=0
Input2(Bias):
data_type:int32
range:[int32_min,int32_max]
granularity:per-axis
restriction:(scale,zero_point)=(input0_scale*input1_scale[...],0)
Output0:
data_type:int8
range:[-128,127]
granularity:per-tensor
DEPTHWISE_CONV_2D
Input0:
data_type:int8
range:[-128,127]
granularity:per-tensor
Input1(Weight):
data_type:int8
range:[-127,127]
granularity:per-axis(dim=3)
restriction:zero_point=0
Input2(Bias):
data_type:int32
range:[int32_min,int32_max]
granularity:per-axis
restriction:(scale,zero_point)=(input0_scale*input1_scale[...],0)
Output0:
data_type:int8
range:[-128,127]
granularity:per-tensor
FULLY_CONNECTED
Input0:
data_type:int8
range:[-128,127]
granularity:per-tensor
Input1(Weight):
data_type:int8
range:[-127,127]
granularity:per-axis(dim=0)
restriction:zero_point=0
Input2(Bias):
data_type:int32
range:[int32_min,int32_max]
granularity:per-tensor
restriction:(scale,zero_point)=(input0_scale*input1_scale[...],0)
Output0:
data_type:int8
range:[-128,127]
granularity:per-tensor
L2_NORMALIZATION
Input0:
data_type:int8
range:[-128,127]
granularity:per-tensor
Output0:
data_type:int8
range:[-128,127]
granularity:per-tensor
restriction:(scale,zero_point)=(1.0/128.0,0)
LOGISTIC
Input0:
data_type:int8
range:[-128,127]
granularity:per-tensor
Output0:
data_type:int8
range:[-128,127]
granularity:per-tensor
restriction:(scale,zero_point)=(1.0/256.0,-128)
MAX_POOL_2D
Input0:
data_type:int8
range:[-128,127]
granularity:per-tensor
Output0:
data_type:int8
range:[-128,127]
granularity:per-tensor
restriction:Inputandoutputsmustallhavesamescale/zero_point
MUL
Input0:
data_type:int8
range:[-128,127]
granularity:per-tensor
Input1:
data_type:int8
range:[-128,127]
granularity:per-tensor
Output0:
data_type:int8
range:[-128,127]
granularity:per-tensor
RESHAPE
Input0:
data_type:int8
range:[-128,127]
granularity:per-tensor
Output0:
data_type:int8
range:[-128,127]
granularity:per-tensor
restriction:Inputandoutputsmustallhavesamescale/zero_point
RESIZE_BILINEAR
Input0:
data_type:int8
range:[-128,127]
granularity:per-tensor
Output0:
data_type:int8
range:[-128,127]
granularity:per-tensor
restriction:Inputandoutputsmustallhavesamescale/zero_point
SOFTMAX
Input0:
data_type:int8
range:[-128,127]
granularity:per-tensor
Output0:
data_type:int8
range:[-128,127]
granularity:per-tensor
restriction:(scale,zero_point)=(1.0/256.0,-128)
SPACE_TO_DEPTH
Input0:
data_type:int8
range:[-128,127]
granularity:per-tensor
Output0:
data_type:int8
range:[-128,127]
granularity:per-tensor
restriction:Inputandoutputsmustallhavesamescale/zero_point
TANH
Input0:
data_type:int8
range:[-128,127]
granularity:per-tensor
Output0:
data_type:int8
range:[-128,127]
granularity:per-tensor
restriction:(scale,zero_point)=(1.0/128.0,0)
PAD
Input0:
data_type:int8
range:[-128,127]
granularity:per-tensor
Output0:
data_type:int8
range:[-128,127]
granularity:per-tensor
restriction:Inputandoutputsmustallhavesamescale/zero_point
GATHER
Input0:
data_type:int8
range:[-128,127]
granularity:per-tensor
Output0:
data_type:int8
range:[-128,127]
granularity:per-tensor
restriction:Inputandoutputsmustallhavesamescale/zero_point
BATCH_TO_SPACE_ND
Input0:
data_type:int8
range:[-128,127]
granularity:per-tensor
Output0:
data_type:int8
range:[-128,127]
granularity:per-tensor
restriction:Inputandoutputsmustallhavesamescale/zero_point
SPACE_TO_BATCH_ND
Input0:
data_type:int8
range:[-128,127]
granularity:per-tensor
Output0:
data_type:int8
range:[-128,127]
granularity:per-tensor
restriction:Inputandoutputsmustallhavesamescale/zero_point
TRANSPOSE
Input0:
data_type:int8
range:[-128,127]
granularity:per-tensor
Output0:
data_type:int8
range:[-128,127]
granularity:per-tensor
restriction:Inputandoutputsmustallhavesamescale/zero_point
MEAN
Input0:
data_type:int8
range:[-128,127]
granularity:per-tensor
Output0:
data_type:int8
range:[-128,127]
granularity:per-tensor
SUB
Input0:
data_type:int8
range:[-128,127]
granularity:per-tensor
Input1:
data_type:int8
range:[-128,127]
granularity:per-tensor
Output0:
data_type:int8
range:[-128,127]
granularity:per-tensor
SUM
Input0:
data_type:int8
range:[-128,127]
granularity:per-tensor
Output0:
data_type:int8
range:[-128,127]
granularity:per-tensor
SQUEEZE
Input0:
data_type:int8
range:[-128,127]
granularity:per-tensor
Output0:
data_type:int8
range:[-128,127]
granularity:per-tensor
restriction:Inputandoutputsmustallhavesamescale/zero_point
LOG_SOFTMAX
Input0:
data_type:int8
range:[-128,127]
granularity:per-tensor
Output0:
data_type:int8
range:[-128,127]
granularity:per-tensor
restriction:(scale,zero_point)=(16.0/256.0,127)
MAXIMUM
Input0:
data_type:int8
range:[-128,127]
granularity:per-tensor
Output0:
data_type:int8
range:[-128,127]
granularity:per-tensor
restriction:Inputandoutputsmustallhavesamescale/zero_point
ARG_MAX
Input0:
data_type:int8
range:[-128,127]
granularity:per-tensor
MINIMUM
Input0:
data_type:int8
range:[-128,127]
granularity:per-tensor
Output0:
data_type:int8
range:[-128,127]
granularity:per-tensor
restriction:Inputandoutputsmustallhavesamescale/zero_point
LESS
Input0:
data_type:int8
range:[-128,127]
granularity:per-tensor
Input1:
data_type:int8
range:[-128,127]
granularity:per-tensor
PADV2
Input0:
data_type:int8
range:[-128,127]
granularity:per-tensor
Output0:
data_type:int8
range:[-128,127]
granularity:per-tensor
restriction:Inputandoutputsmustallhavesamescale/zero_point
GREATER
Input0:
data_type:int8
range:[-128,127]
granularity:per-tensor
Input1:
data_type:int8
range:[-128,127]
granularity:per-tensor
GREATER_EQUAL
Input0:
data_type:int8
range:[-128,127]
granularity:per-tensor
Input1:
data_type:int8
range:[-128,127]
granularity:per-tensor
LESS_EQUAL
Input0:
data_type:int8
range:[-128,127]
granularity:per-tensor
Input1:
data_type:int8
range:[-128,127]
granularity:per-tensor
SLICE
Input0:
data_type:int8
range:[-128,127]
granularity:per-tensor
Output0:
data_type:int8
range:[-128,127]
granularity:per-tensor
restriction:Inputandoutputsmustallhavesamescale/zero_point
EQUAL
Input0:
data_type:int8
range:[-128,127]
granularity:per-tensor
Input1:
data_type:int8
range:[-128,127]
granularity:per-tensor
NOT_EQUAL
Input0:
data_type:int8
range:[-128,127]
granularity:per-tensor
Input1:
data_type:int8
range:[-128,127]
granularity:per-tensor
SHAPE
Input0:
data_type:int8
range:[-128,127]
granularity:per-tensor
QUANTIZE(Requantization)
Input0:
data_type:int8
range:[-128,127]
granularity:per-tensor
Output0:
data_type:int8
range:[-128,127]
granularity:per-tensor

References

arXiv:1712.05877