How to use all GPUs in SageMaker real-time inference?

Question:

I have deployed a model on real-time inference in a single gpu instance, it works fine.

Now I want to use a multiple GPUs to decrease the inference time, what do I need to change in my inference.py to make it work?

Here is some of my code:

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
def model_fn(model_dir):
    logger.info("Loading first model...")
    model = Model().to(DEVICE)
    with open(os.path.join(model_dir, "checkpoint.pth"), "rb") as f:
        model.load_state_dict(torch.load(f, map_location=DEVICE)['state_dict'])
    model = model.eval()
    
    logger.info("Loading second model...")
    model_2 = Model_2()
    model_2.to(DEVICE)
    checkpoint = torch.load('checkpoint_2.pth', map_location=DEVICE)
    model_2(remove_prefix_state_dict(checkpoint['state_dict']), strict=True)
    model_2 = model_2()
    
    logger.info('Done loading models')
    return {'first_model': model, 'second_model': model_2}

def input_fn(request_body, request_content_type):
    assert request_content_type=='application/json'
    url = json.loads(request_body)['url']
    save_name = json.loads(request_body)['save_name']
    logger.info(f'Image url: {url}')
    img = Image.open(requests.get(url, stream=True).raw).convert('RGB')
    w, h = img.size
    input_tensor = preprocess(img)
    input_batch = input_tensor.unsqueeze(0).to(DEVICE)
    logger.info('Image ready to predict!')
    return {'tensor':input_batch, 'w':w,'h':h,'image':img, 'save_name':save_name}

def predict_fn(input_object, model):
    data = input_object['tensor']
    logger.info('Generating prediction based on the input image')
    model_1 = model['first_model']
    model_2 = model['second_model']
    d0, d1, d2, d3, d4, d5, d6 = model_1(data)
    torch.cuda.empty_cache()
    mask = torch.argmax(d0[0], axis=0).cpu().numpy()
    mask = np.where(mask==2, 255, mask)
    mask = np.where(mask==1, 128, mask)
    img = input_object['image']
    final_image = Image.fromarray(mask).resize((input_object['w'], input_object['h'])).convert('L')
    img = np.array(img)[:,:,::-1]
    final_image = np.array(final_image)
    image_dict = to_dict(img, final_image)
    final_image = model_2_process(model_2, image_dict)
    torch.cuda.empty_cache()
    
    return {"final_ouput": final_image, 'image':input_object['image'], 'save_name': input_object['save_name']}

I was thinking that maybe with torch multiprocessing, any tips?

Asked By: Diego Rodea

||

Answers:

You must use torch.nn.DataParallel or torch.nn.parallel.DistributedDataParallel (read "Multi-GPU Examples" and "Use nn.parallel.DistributedDataParallel instead of multiprocessing or nn.DataParallel").

You must call the function by passing at least these three parameters:

module (Module) – module to be parallelized (your model)

device_ids (list of python:int or torch.device) – CUDA devices.

  1. For single-device modules, device_ids can contain
    exactly one device id, which represents the only CUDA device where the
    input module corresponding to this process resides. Alternatively,
    device_ids can also be None.
  2. For multi-device modules and CPU
    modules, device_ids must be None.
    When device_ids is None for both cases, both the input data for the
    forward pass and the actual module must be placed on the correct
    device. (default: None)

output_device (int or torch.device) – Device location of output for single-device CUDA modules.

For multi-device modules and CPU modules, it must be None, and the module itself dictates the output location. (default: device_ids[0] for single-device modules)

for example:

from torch.nn.parallel import DistributedDataParallel
model = DistributedDataParallel(model, device_ids=[i], output_device=i)
Answered By: Giuseppe La Gualano

The answer mentioning Torch DDP and DP is not exactly appropriate since the value of those libraries is to conduct multi-GPU gradient descent (averaging the gradient inter-GPU in particular), which, as mentioned in 1., does not happen at inference. Actually, a well-done, optimized inference ideally doesn’t even use PyTorch or TensorFlow at all, but instead a prediction-only optimized runtime such as SageMaker Neo, ONNXRuntime or NVIDIA TensorRT, to reduce memory footprint and latency.

to infer a single model that fits in a GPU, multi-GPU instances are generally not advised: inference is a share-nothing task, so that you can use N single-GPU instance and things are simpler and equally performant.
Inference on Multi-GPU host is useful in 2 cases: (1) if you do model parallel inference (not your case) or (2) if your service inference consists of a graph of models that are calling each other. In which case, the proximity of the various models called in the DAG can reduce latency. That seems to be your situation

My recommendations are the following:

  1. Try using NVIDIA Triton, that supports well those DAG use-cases and is supported on SageMaker. https://aws.amazon.com/fr/blogs/machine-learning/deploy-fast-and-scalable-ai-with-nvidia-triton-inference-server-in-amazon-sagemaker/

  2. If you want to do things custom, you could try assigning the 2 models to different cuda device id in PyTorch. Because cuda kernels are run asynchronously this could be enough to have some parallelism and a bit of acceleration vs 1 GPU if your models can run parallel

I saw multiprocessing used once (with MXNet) to load-balance inference requests across GPUs (in this AWS blog post) but it was for share-nothing, map-style distribution of batches of inferences. In your case you seem to have to connection between your model so Triton is probably a better fit.

Eventually, if your goal is to reduce latency, there are other ideas:

  1. Fix any CPU bottleneck Your code seem to have a lot of CPU work (pre-processing, numpy…). Are you sure GPU is the bottleneck? If CPU is at 80%+, try large single-GPU G5, such as G5.16xlarge. They are great for computer vision inference

  2. Use a better GPU if you are using a P2, P3 or G4dn, try G5 instead

  3. Optimize code. 2 things to try, depending on the bottleneck:

    1. If you do the inference in Torch, try to avoid doing algebra with Numpy, and do as much as possible with torch tensors on GPU.
    2. If GPU is the bottleneck, try to replace PyTorch by ONNXRuntime or NVIDIA TensorRT.
Answered By: Olivier Cruchant