TorchServe: How to convert bytes output to tensors

Question:

I have a model that is served using TorchServe. I’m communicating with the TorchServe server using gRPC. The final postprocess method of the custom handler defined returns a list which is converted into bytes for transfer over the network.

The post process method

def postprocess(self, data):
    # data type - torch.Tensor
    # data shape - [1, 17, 80, 64] and data dtype - torch.float32
    return data.tolist()

The main issue is at the client where converting the received bytes from TorchServe to a torch Tensor is inefficiently done via ast.literal_eval

# This takes 0.3 seconds
response = self.inference_stub.Predictions(
            inference_pb2.PredictionsRequest(model_name=model_name, input=input_data))
# This takes 0.84 seconds
predictions = torch.as_tensor(literal_eval(
            response.prediction.decode('utf-8')))

Using numpy.frombuffer or torch.frombuffer return the following error.

import numpy as np

np.frombuffer(response.prediction)
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ValueError: buffer size must be a multiple of element size

np.frombuffer(response.prediction, dtype=np.float32)
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ValueError: buffer size must be a multiple of element size

Using torch

import torch
torch.frombuffer(response.prediction, dtype = torch.float32)
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ValueError: buffer length (2601542 bytes) after offset (0 bytes) must be a multiple of element size (4)

Is there an alternative, more efficient solution of converting the received bytes into torch.Tensor?

Asked By: Mohit Motwani

||

Answers:

One hack I’ve found that has significantly increased the performance while sending large tensors is to return a list of json.

In your handler’s postprocess function:

def postprocess(self, data):
    output_data = {}
    output_data['data'] = data.tolist()
    return [output_data]

At the clients side when you receive the grpc response, decode it using json.loads

response = self.inference_stub.Predictions(
            inference_pb2.PredictionsRequest(model_name=model_name, input=input_data))
decoded_output = response.prediction.decode('utf-8')
preds = torch.as_tensor(json.loads(decoded_output))

preds should have the output tensor

Update:

There’s an even faster method and should completely solve the bottleneck. Use tf.io.serialize_tensor from tensorflow to serialize your tensor inside postprocess

def postprocess(self, data):
    return [tf.io.serialize_tensor(data.cpu()).numpy()]

Decode it using tf.io.parse_tensor

response = self.inference_stub.Predictions(
            inference_pb2.PredictionsRequest(model_name=model_name, input=input_data))
prediction = response.prediction
torch.as_tensor(tf.io.parse_tensor(prediction, out_type=tf.float32).numpy())
Answered By: Mohit Motwani