The expanded size of the tensor (1011) must match the existing size (512) at non-singleton dimension 1

Question:

I have a trained a LayoutLMv2 model from huggingface and when I try to inference it on a single image, it gives the runtime error. The code for this is below:

query = '/Users/vaihabsaxena/Desktop/Newfolder/labeled/Others/Two.pdf26.png'
image = Image.open(query).convert("RGB")
encoded_inputs = processor(image, return_tensors="pt").to(device)
outputs = model(**encoded_inputs)
preds = torch.softmax(outputs.logits, dim=1).tolist()[0]
pred_labels = {label:pred for label, pred in zip(label2idx.keys(), preds)}
pred_labels

The error comes when when I do model(**encoded_inputs). The processor is called directory from Huggingface and is initialized as follows along with other APIs:

feature_extractor = LayoutLMv2FeatureExtractor()
tokenizer = LayoutLMv2Tokenizer.from_pretrained("microsoft/layoutlmv2-base-uncased")
processor = LayoutLMv2Processor(feature_extractor, tokenizer)

The model is defined and trained as follows:

model = LayoutLMv2ForSequenceClassification.from_pretrained(
    "microsoft/layoutlmv2-base-uncased",  num_labels=len(label2idx)
)
model.to(device);


optimizer = AdamW(model.parameters(), lr=5e-5)
num_epochs = 3


for epoch in range(num_epochs):
    print("Epoch:", epoch)
    training_loss = 0.0
    training_correct = 0
    #put the model in training mode
    model.train()
    for batch in tqdm(train_dataloader):
        outputs = model(**batch)
        loss = outputs.loss

        training_loss += loss.item()
        predictions = outputs.logits.argmax(-1)
        training_correct += (predictions == batch['labels']).float().sum()

        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

    print("Training Loss:", training_loss / batch["input_ids"].shape[0])
    training_accuracy = 100 * training_correct / len(train_data)
    print("Training accuracy:", training_accuracy.item())  
        
    validation_loss = 0.0
    validation_correct = 0
    for batch in tqdm(valid_dataloader):
        outputs = model(**batch)
        loss = outputs.loss

        validation_loss += loss.item()
        predictions = outputs.logits.argmax(-1)
        validation_correct += (predictions == batch['labels']).float().sum()

    print("Validation Loss:", validation_loss / batch["input_ids"].shape[0])
    validation_accuracy = 100 * validation_correct / len(valid_data)
    print("Validation accuracy:", validation_accuracy.item())

The complete error trace:

RuntimeError                              Traceback (most recent call last)
/Users/vaihabsaxena/Desktop/Newfolder/pytorch.ipynb Cell 37 in <cell line: 4>()
      2 image = Image.open(query).convert("RGB")
      3 encoded_inputs = processor(image, return_tensors="pt").to(device)
----> 4 outputs = model(**encoded_inputs)
      5 preds = torch.softmax(outputs.logits, dim=1).tolist()[0]
      6 pred_labels = {label:pred for label, pred in zip(label2idx.keys(), preds)}

File ~/opt/anaconda3/envs/env_pytorch/lib/python3.9/site-packages/torch/nn/modules/module.py:1130, in Module._call_impl(self, *input, **kwargs)
   1126 # If we don't have any hooks, we want to skip the rest of the logic in
   1127 # this function, and just call forward.
   1128 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1129         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1130     return forward_call(*input, **kwargs)
   1131 # Do not call functions when jit is used
   1132 full_backward_hooks, non_full_backward_hooks = [], []

File ~/opt/anaconda3/envs/env_pytorch/lib/python3.9/site-packages/transformers/models/layoutlmv2/modeling_layoutlmv2.py:1071, in LayoutLMv2ForSequenceClassification.forward(self, input_ids, bbox, image, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, labels, output_attentions, output_hidden_states, return_dict)
   1061 visual_position_ids = torch.arange(0, visual_shape[1], dtype=torch.long, device=device).repeat(
   1062     input_shape[0], 1
   1063 )
   1065 initial_image_embeddings = self.layoutlmv2._calc_img_embeddings(
   1066     image=image,
   1067     bbox=visual_bbox,
...
    896     input_shape[0], 1
    897 )
    898 final_position_ids = torch.cat([position_ids, visual_position_ids], dim=1)

RuntimeError: The expanded size of the tensor (1011) must match the existing size (512) at non-singleton dimension 1.  Target sizes: [1, 1011].  Tensor sizes: [1, 512]

I have tried to set up the tokenizer to cut off the max length but it finds encoded_inputs as Nonetype however the image is still there. What is going wrong here?

Asked By: Vai

||

Answers:

The error message tells you that the extracted text via ocr is longer (1011 tokens) than the underlying text model is able to handle (512 tokens). Depending on your task, you maybe can truncate your text with the tokenizer parameter truncation (the processor will pass this parameter to the tokenizer):

import torch
from transformers import LayoutLMv2FeatureExtractor, LayoutLMv2Tokenizer, LayoutLMv2Processor, LayoutLMv2ForSequenceClassification
from PIL import Image, ImageDraw, ImageFont

query = "/content/Screenshot_20220905_202551.png"
image = Image.open(query).convert("RGB")

feature_extractor = LayoutLMv2FeatureExtractor()
tokenizer = LayoutLMv2Tokenizer.from_pretrained("microsoft/layoutlmv2-base-uncased")
processor = LayoutLMv2Processor(feature_extractor, tokenizer)
model = LayoutLMv2ForSequenceClassification.from_pretrained("microsoft/layoutlmv2-base-uncased",  num_labels=2)

encoded_inputs = processor(image, return_tensors="pt")
# Model will raise an error because the tensor is longer as the trained position embeddings
print(encoded_inputs["input_ids"].shape)
encoded_inputs = processor(image, return_tensors="pt", truncation=True)
print(encoded_inputs["input_ids"].shape)
outputs = model(**encoded_inputs)
preds = torch.softmax(outputs.logits, dim=1).tolist()[0]

Output:

torch.Size([1, 644])
torch.Size([1, 512])

For this code, I used the following screenshot:
donut paper

Answered By: cronoik