How to quantize inputs and outputs of optimized tflite model

Question:

I use the following code to generate a quantized tflite model

import tensorflow as tf

def representative_dataset_gen():
  for _ in range(num_calibration_steps):
    # Get sample input data as a numpy array in a method of your choosing.
    yield [input]

converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_dataset_gen
tflite_quant_model = converter.convert()

But according to post training quantization:

The resulting model will be fully quantized but still take float input and output for convenience.

To compile tflite model for Google Coral Edge TPU I need quantized input and output as well.

In the model, I see that the first network layer converts float input to input_uint8 and the last layer converts output_uint8 to the float output.
How do I edit tflite model to get rid of the first and last float layers?

I know that I could set input and output type to uint8 during conversion, but this is not compatible with any optimizations. The only available option then is to use fake quantization which results in a bad model.

Asked By: Ivan Kovtun

||

Answers:

You can avoid the float to int8 and int8 to float “quant/dequant” op by setting inference_input_type and inference_output_type (https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/python/lite.py#L460-L476) to int8.

Answered By: J.L.
converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir
converter.optimizations = [tf.lite.Optimize.DEFAULT] 
converter.representative_dataset = representative_dataset
#The below 3 lines performs the input - output quantization
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.uint8
converter.inference_output_type = tf.uint8
tflite_model = converter.convert()
Answered By: Sandeep Vivek

This:

def representative_data_gen():
  for input_value in tf.data.Dataset.from_tensor_slices(train_images).batch(1).take(100):
    # Model has only one input so each data point has one element.
    yield [input_value]

converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_data_gen

tflite_model_quant = converter.convert()

generates a Float32 model with Float32 inputs and outputs. This:

def representative_data_gen():
  for input_value in tf.data.Dataset.from_tensor_slices(train_images).batch(1).take(100):
    yield [input_value]

converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_data_gen
# Ensure that if any ops can't be quantized, the converter throws an error
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
# Set the input and output tensors to uint8 (APIs added in r2.3)
converter.inference_input_type = tf.uint8
converter.inference_output_type = tf.uint8

tflite_model_quant = converter.convert()

generates a UINT8 model with UINT8 inputs and outputs

You can make sure this is the case by:

interpreter = tf.lite.Interpreter(model_content=tflite_model_quant)
input_type = interpreter.get_input_details()[0]['dtype']
print('input: ', input_type)
output_type = interpreter.get_output_details()[0]['dtype']
print('output: ', output_type)

which returns:

input:  <class 'numpy.uint8'>
output:  <class 'numpy.uint8'>

if you went for a full UINT8 quantization. You can double check this by inspecting your model visually using netron

Answered By: Mike B