Spacy tokenization add extra white space for dates with hyphen separator when I manually build the Doc

Question:

I’ve been trying to solve a problem with the spacy Tokenizer for a while, without any success. Also, I’m not sure if it’s a problem with the tokenizer or some other part of the pipeline.

Description

I have an application that for reasons besides the point, creates a spacy Doc from the spacy vocab and the list of tokens from a string (see code below). Note that while this is not the simplest and most common way to do this, according to spacy doc this can be done.

However, when I create a Doc for a text that contains compound words or dates with hyphen as a separator, the behavior I am getting is not what I expected.

import spacy
from spacy.language import Doc

# My current way
doc = Doc(nlp.vocab, words=tokens)  # Tokens is a well defined list of tokens for a certein string

# Standard way
doc = nlp("My text...")

For example, with the following text, if I create the Doc using the standard procedure, the spacy Tokenizer recognizes the "-" as tokens but the Doc text is the same as the input text, in addition the spacy NER model correctly recognizes the DATE entity.

import spacy

doc = nlp("What time will sunset be on 2022-12-24?")
print(doc.text)

tokens = [str(token) for token in doc]
print(tokens)

# Show entities
print(doc.ents[0].label_)
print(doc.ents[0].text)

Output:

What time will sunset be on 2022-12-24?
['What', 'time', 'will', 'sunset', 'be', 'on', '2022', '-', '12', '-', '24', '?']

DATE
2022-12-24

On the other hand, if I create the Doc from the model’s vocab and the previously calculated tokens, the result obtained is different. Note that for the sake of simplicity I am using the tokens from doc, so I’m sure there are no differences in tokens. Also note that I am manually running each pipeline model in the correct order with the doc, so at the end of this process I would theoretically get the same results.

However, as you can see in the output below, while the Doc’s tokens are the same, the Doc’s text is different, there were blank spaces between the digits and the date separators.

doc2 = Doc(nlp.vocab, words=tokens)

# Run each model in pipeline
for model_name in nlp.pipe_names:
    pipe = nlp.get_pipe(model_name)
    doc2 = pipe(doc2)

# Print text and tokens
print(doc2.text)
tokens = [str(token) for token in doc2]
print(tokens)

# Show entities
print(doc.ents[0].label_)
print(doc.ents[0].text)

Output:

what time will sunset be on 2022 - 12 - 24 ? 
['what', 'time', 'will', 'sunset', 'be', 'on', '2022', '-', '12', '-', '24', '?']

DATE
2022 - 12 - 24

I know it must be something silly that I’m missing but I don’t realize it.

Could someone please explain to me what I’m doing wrong and point me in the right direction?

Thanks a lot in advance!

EDIT

Following the Talha Tayyab suggestion, I have to create an array of booleans with the same length that my list of tokens to indicate for each one, if the token is followed by an empty space. Then pass this array in doc construction as follows: doc = Doc(nlp.vocab, words=words, spaces=spaces).

To compute this list of boolean values ​​based on my original text string and list of tokens, I implemented the following vanilla function:

def get_spaces(self, text: str, tokens: List[str]) -> List[bool]:
     
    # Spaces
    spaces = []
    # Copy text to easy operate
    t = text.lower()

    # Iterate over tokens
    for token in tokens:

        if t.startswith(token.lower()):

            t = t[len(token):]  # Remove token

            # If after removing token we have an empty space
            if len(t) > 0 and t[0] == " ":
                spaces.append(True)
                t = t[1:]  # Remove space
            else:
                spaces.append(False)

    return spaces

With these two improvements in my code, the result obtained is as expected. However, now I have the following question:

Is there a more spacy-like way to compute whitespace, instead of using my vanilla implementation?

Asked By: Emiliano Viotti

||

Answers:

Please try this:

from spacy.language import Doc
doc2 = Doc(nlp.vocab, words=tokens,spaces=[1,1,1,1,1,1,0,0,0,0,0,0])
# Run each model in pipeline
for model_name in nlp.pipe_names:
    pipe = nlp.get_pipe(model_name)
    doc2 = pipe(doc2)

# Print text and tokens
print(doc2.text)
tokens = [str(token) for token in doc2]
print(tokens)

# Show entities
print(doc.ents[0].label_)
print(doc.ents[0].text)

# You can also replace 0 with False and 1 with True

This is the complete syntax:

doc = Doc(nlp.vocab, words=words, spaces=spaces)

spaces are a list of boolean values indicating whether each word has a subsequent space. Must have the same length as words, if specified. Defaults to a sequence of True.

So you can choose which ones you gonna have space and which ones you do not need.

Reference: https://spacy.io/api/doc

Answered By: Talha Tayyab

Late to this, but as you’ve retrieved the tokens from a document to begin with I think you can just use the whitespace_ attribute of the token for this. Then your ‘get_spaces` function looks like:

def get_spaces(tokens):
    return [1 if token.whitespace_ else 0 for token in tokens]

Note that this won’t work nicely if there multiple spaces or non-space whitespace (e.g. tabs), but then you probably need to update the tokenizer or use your existing solution and update this part:

            if len(t) > 0 and t[0] == " ":
                spaces.append(True)
                t = t[1:]  # Remove space

to check for generic whitespace and remove more than just a leading space.

Answered By: radpotato
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.