BertTokenizer – when encoding and decoding sequences extra spaces appear

Question:

When using Transformers from HuggingFace I am facing a problem with the encoding and decoding method.

I have a the following string:

test_string = 'text with percentage%'

Then I am running the following code:

import torch
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

test_string = 'text with percentage%'

# encode Converts a string in a sequence of ids (integer), using the tokenizer and vocabulary.
input_ids = tokenizer.encode(test_string)
output = tokenizer.decode(input_ids)

And the output looks like this:

'text with percentage %'

With an extra space before the %. I have tried the extra arguments like clean_up_tokenization_spaces but this is for something different.

How what should I use in the decoding and encoding to get exactly the same text before and after. This also happens for other special signs.

Asked By: Henryk Borzymowski

||

Answers:

According to https://github.com/huggingface/transformers/pull/1274 they’re working on it. hopefully there will be a solution sometime next week.

Answered By: Anjie Guo

If you are trying to use BERT for token classification in order to find a span in your original string, then one workaround is to use BertTokenizerFast with the option return_offsets_mapping=True.

test_string = 'text with percentage%'

tokenizer = BertTokenizerFast.from_pretrained('bert-base-cased')
tokens = tokenizer(test_string, return_offsets_mapping=True)
input_ids = tokens.data["input_ids"]

span_start_index, span_stop_index = some_model(input_ids)

Then once you get the token classification results, you can do something like

predicted_span = test_string[tokens.encodings[0].offsets[span_start_index][0]:tokens.encodings[0].offsets[span_stop_index][1]]
Answered By: vermouth

One method of combining percentage and %, but I am unsure if it is useful for you.

from transformers import AutoTokenizer
test_string = 'text with percentage%'
test_string = test_string.split()

collect_tokens = []
for string in test_string:
    tokens = tokenizer.tokenize(string)
    for index in range(1, len(tokens)):
        if "##" not in tokens[index]:
            tokens[index] = "##"+tokens[index]
    collect_tokens += tokens

# encode tokens to input_ids
input_ids = tokenizer.convert_tokens_to_ids(collect_tokens)

# decode
output = tokenizer.decode(input_ids)

Answered By: huang ting shieh