BertTokenizer – when encoding and decoding sequences extra spaces appear
Question:
When using Transformers from HuggingFace I am facing a problem with the encoding and decoding method.
I have a the following string:
test_string = 'text with percentage%'
Then I am running the following code:
import torch
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
test_string = 'text with percentage%'
# encode Converts a string in a sequence of ids (integer), using the tokenizer and vocabulary.
input_ids = tokenizer.encode(test_string)
output = tokenizer.decode(input_ids)
And the output looks like this:
'text with percentage %'
With an extra space before the %. I have tried the extra arguments like clean_up_tokenization_spaces
but this is for something different.
How what should I use in the decoding and encoding to get exactly the same text before and after. This also happens for other special signs.
Answers:
According to https://github.com/huggingface/transformers/pull/1274 they’re working on it. hopefully there will be a solution sometime next week.
If you are trying to use BERT for token classification in order to find a span in your original string, then one workaround is to use BertTokenizerFast
with the option return_offsets_mapping=True
.
test_string = 'text with percentage%'
tokenizer = BertTokenizerFast.from_pretrained('bert-base-cased')
tokens = tokenizer(test_string, return_offsets_mapping=True)
input_ids = tokens.data["input_ids"]
span_start_index, span_stop_index = some_model(input_ids)
Then once you get the token classification results, you can do something like
predicted_span = test_string[tokens.encodings[0].offsets[span_start_index][0]:tokens.encodings[0].offsets[span_stop_index][1]]
One method of combining percentage and %, but I am unsure if it is useful for you.
from transformers import AutoTokenizer
test_string = 'text with percentage%'
test_string = test_string.split()
collect_tokens = []
for string in test_string:
tokens = tokenizer.tokenize(string)
for index in range(1, len(tokens)):
if "##" not in tokens[index]:
tokens[index] = "##"+tokens[index]
collect_tokens += tokens
# encode tokens to input_ids
input_ids = tokenizer.convert_tokens_to_ids(collect_tokens)
# decode
output = tokenizer.decode(input_ids)
When using Transformers from HuggingFace I am facing a problem with the encoding and decoding method.
I have a the following string:
test_string = 'text with percentage%'
Then I am running the following code:
import torch
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
test_string = 'text with percentage%'
# encode Converts a string in a sequence of ids (integer), using the tokenizer and vocabulary.
input_ids = tokenizer.encode(test_string)
output = tokenizer.decode(input_ids)
And the output looks like this:
'text with percentage %'
With an extra space before the %. I have tried the extra arguments like clean_up_tokenization_spaces
but this is for something different.
How what should I use in the decoding and encoding to get exactly the same text before and after. This also happens for other special signs.
According to https://github.com/huggingface/transformers/pull/1274 they’re working on it. hopefully there will be a solution sometime next week.
If you are trying to use BERT for token classification in order to find a span in your original string, then one workaround is to use BertTokenizerFast
with the option return_offsets_mapping=True
.
test_string = 'text with percentage%'
tokenizer = BertTokenizerFast.from_pretrained('bert-base-cased')
tokens = tokenizer(test_string, return_offsets_mapping=True)
input_ids = tokens.data["input_ids"]
span_start_index, span_stop_index = some_model(input_ids)
Then once you get the token classification results, you can do something like
predicted_span = test_string[tokens.encodings[0].offsets[span_start_index][0]:tokens.encodings[0].offsets[span_stop_index][1]]
One method of combining percentage and %, but I am unsure if it is useful for you.
from transformers import AutoTokenizer
test_string = 'text with percentage%'
test_string = test_string.split()
collect_tokens = []
for string in test_string:
tokens = tokenizer.tokenize(string)
for index in range(1, len(tokens)):
if "##" not in tokens[index]:
tokens[index] = "##"+tokens[index]
collect_tokens += tokens
# encode tokens to input_ids
input_ids = tokenizer.convert_tokens_to_ids(collect_tokens)
# decode
output = tokenizer.decode(input_ids)