Using nlp.pipe() with pre-segmented and pre-tokenized text with spaCy

Question:

I am trying to tag and parse text that has already been split up in sentences and has already been tokenized. As an example:

sents = [['I', 'like', 'cookies', '.'], ['Do', 'you', '?']]

The fastest approach to process batches of text is .pipe(). However, it is not clear to me how I can use that with pre-tokenized, and pre-segmented text. Performance is key here. I tried the following, but that threw an error

docs = [nlp.tokenizer.tokens_from_list(sentence) for sentence in sents]
nlp.tagger(docs)
nlp.parser(docs)

Trace:

Traceback (most recent call last):
  File "C:PythonPython37Libmultiprocessingpool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "C:PythonprojectsPreDicTpredicting-wtebuild_id_dictionary.py", line 204, in process_batch
    self.nlp.tagger(docs)
  File "pipes.pyx", line 377, in spacy.pipeline.pipes.Tagger.__call__
  File "pipes.pyx", line 396, in spacy.pipeline.pipes.Tagger.predict
  File "C:Usersbmvroy.virtualenvspredicting-wte-YKqW76balibsite-packagesthincneural_classesmodel.py", line 169, in __call__
    return self.predict(x)
  File "C:Usersbmvroy.virtualenvspredicting-wte-YKqW76balibsite-packagesthincneural_classesfeed_forward.py", line 40, in predict
    X = layer(X)
  File "C:Usersbmvroy.virtualenvspredicting-wte-YKqW76balibsite-packagesthincneural_classesmodel.py", line 169, in __call__
    return self.predict(x)
  File "C:Usersbmvroy.virtualenvspredicting-wte-YKqW76balibsite-packagesthincneural_classesmodel.py", line 133, in predict
    y, _ = self.begin_update(X, drop=None)
  File "C:Usersbmvroy.virtualenvspredicting-wte-YKqW76balibsite-packagesthincneural_classesfeature_extracter.py", line 14, in begin_update
    features = [self._get_feats(doc) for doc in docs]
  File "C:Usersbmvroy.virtualenvspredicting-wte-YKqW76balibsite-packagesthincneural_classesfeature_extracter.py", line 14, in <listcomp>
    features = [self._get_feats(doc) for doc in docs]
  File "C:Usersbmvroy.virtualenvspredicting-wte-YKqW76balibsite-packagesthincneural_classesfeature_extracter.py", line 21, in _get_feats
    arr = doc.doc.to_array(self.attrs)[doc.start : doc.end]
AttributeError: 'list' object has no attribute 'doc'
Asked By: Bram Vanroy

||

Answers:

Just replace the default tokenizer in the pipeline with nlp.tokenizer.tokens_from_list instead of calling it separately:

import spacy
nlp = spacy.load('en')
nlp.tokenizer = nlp.tokenizer.tokens_from_list

for doc in nlp.pipe([['I', 'like', 'cookies', '.'], ['Do', 'you', '?']]):
    for token in doc:
        print(token, token.pos_)

Output:

I PRON
like VERB
cookies NOUN
. PUNCT
Do VERB
you PRON
? PUNCT
Answered By: aab

In Spacy v3, tokens_from_list no longer exists. Instead, you do it this way:

class YourTokenizer :
    
    def __call__(self, your_doc_object):        
        return Doc(
            nlp.vocab, 
            words=get_words(your_doc_object), 
            spaces=get_spaces(your_doc_object)
        )    
    pass // end class

nlp.tokenizer = YourTokenizer()

doc = nlp(your_doc_object)
Answered By: chrishmorris

Use the Doc object

import spacy
from spacy.tokens import Doc
nlp = spacy.load("en_core_web_sm")

sents = [['I', 'like', 'cookies', '.'], ['Do', 'you', '?']]
for sent in sents:
    doc = Doc(nlp.vocab, sent)
    for token in nlp(doc):
        print(token.text, token.pos_)
Answered By: Victor Yan

This seems to be the best method, and currently compatible (as of SpaCy v3.4) It’s not as clean as @aab’s solution, however, their solution is no longer compatible with spacy v3. This solution is based on creating a custom Tokenizer which can accept a list of list of strings (see variable sents). This solution is based on a comment to this answer by @Bram Vanroy which points to this tokenizer (thanks Bram).

import spacy
from spacy import tokens
from typing import List, Any, Union
from spacy.util import DummyTokenizer


def flatten(raw: List[List[Any]]) -> List[Any]:
    """ Turns [['I', 'like', 'cookies', '.'], ['Do', 'you', '?']] to ['I', 'like', 'cookies', '.', 'Do', 'you', '?']"""
    return [tok for sent in raw for tok in sent]


class PreTokenizedPreSentencizedTokenizer(DummyTokenizer):
    """Custom tokenizer to be used in spaCy when the text is already pretokenized."""

    def __init__(self, vocab: spacy.vocab.Vocab):
        """Initialize tokenizer with a given vocab
        :param vocab: an existing vocabulary (see https://spacy.io/api/vocab)
        """
        self.vocab = vocab

    def __call__(self, inp: Union[List[str], str, List[List[str]]]) -> tokens.Doc:
        """Call the tokenizer on input `inp`.
        :param inp: either a string to be split on whitespace, or a list of tokens
        :return: the created Doc object
        """
        if isinstance(inp, str):
            words = inp.split()
            spaces = [True] * (len(words) - 1) + ([True] if inp[-1].isspace() else [False])
            return tokens.Doc(self.vocab, words=words, spaces=spaces)
        elif isinstance(inp, list):
            # Check if we have a flat list or a list of list
            if len(inp) == 0:
                return tokens.Doc(self.vocab, words=inp)
            if isinstance(inp[0], str):
                return tokens.Doc(self.vocab, words=inp)
            elif isinstance(inp[0], list):
                sent_starts = flatten([[1] + [0] * (len(sent) - 1) for sent in inp])
                return tokens.Doc(self.vocab, words=flatten(inp), sent_starts=sent_starts)
        else:
            raise ValueError("Unexpected input format. Expected string, or list of tokens, or list of list of string.")


# Normally load spacy NLP
nlp = spacy.load('en_core_web_sm', exclude=["parser", "senter"])
nlp.tokenizer = PreTokenizedPreSentencizedTokenizer(nlp.vocab)
sents = [['I', 'like', 'cookies', '.'], ['Do', 'you', '?']]

doc = nlp(sents)
print(list(doc.sents).__len__())
print(doc)
Answered By: Priyansh Trivedi