Transformers: How to use CUDA for inferencing?

Question:

I have fine-tuned my models with GPU but inferencing process is very slow, I think this is because inferencing uses CPU by default. Here is my inferencing code:

txt = "This was nice place"
model = transformers.BertForSequenceClassification.from_pretrained(model_path, num_labels=24)
tokenizer = transformers.BertTokenizer.from_pretrained('TurkuNLP/bert-base-finnish-cased-v1')
encoding = tokenizer.encode_plus(txt, add_special_tokens = True, truncation = True, padding = "max_length", return_attention_mask = True, return_tensors = "pt")
output = model(**encoding)
output = output.logits.softmax(dim=-1).detach().cpu().flatten().numpy().tolist()

Here is my second inferencing code, which is using pipeline (for different model):

classifier = transformers.pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")
result = classifier(txt)

How can I force transformers library to do faster inferencing on GPU? I have tried adding model.to(torch.device("cuda")) but that throws error:

Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu

I suppose the problem is related to the data not being sent to GPU. There is a similar issue here: pytorch summary fails with huggingface model II: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu

How would I send data to GPU with and without pipeline? Any advise is highly appreciated.

Asked By: Mr. Engineer

||

Answers:

You should transfer your input to CUDA as well before performing the inference:

device = torch.device('cuda')

# transfer model
model.to(device)

# define input and transfer to device
encoding = tokenizer.encode_plus(txt, 
     add_special_tokens=True, 
     truncation=True, 
     padding="max_length", 
     return_attention_mask=True, 
     return_tensors="pt")

encoding = encoding.to(device)

# inference
output = model(**encoding)

Be aware nn.Module.to is in-place, while torch.Tensor.to is not (it does a copy!).

Answered By: Ivan

For the pipeline code question

The problem is the default behavior in transformers.pipeline is to use CPU. But from here you can add the device=0 parameter to use the 1st GPU, for example.

  • device=0 to utilize GPU cuda:0
  • device=1 to utilize GPU cuda:1
pipeline = pipeline(TASK, model=MODEL_PATH, device=0)

Your code becomes:

classifier = transformers.pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english", device=0)
result = classifier(txt)
Answered By: imbr