ValueError: [E966] `nlp.add_pipe` when changing the sentence segmentaion rule of spaCy model

Question:

I am using Python 3.9.7 and the spaCy library and want to change the way the model segments a given sentence. Here is a sentence and the segmentation rule I created as an example:

import spacy
nlp=spacy.load('en_core_web_sm')

doc2=nlp(u'"Management is doing the  right things; leadership is doing the right things." -Peter Drucker')

def set_custom_boundaries(doc):
    for token in doc[:-1]:
        if token.text==";":
            doc[token.i +1].is_sent_start=True
    return doc

nlp.add_pipe(set_custom_boundaries, before='parser')

However, this produces the error message below:

ValueError                                Traceback (most recent call last)
C:UsersSEYDOU~1AppDataLocalTemp/ipykernel_21000/1705623728.py in <module>
----> 1 nlp.add_pipe(set_custom_boundaries, before='parser')

~Anaconda3libsite-packagesspacylanguage.py in add_pipe(self, factory_name, name, before, after, first, last, source, config, raw_config, validate)
    777             bad_val = repr(factory_name)
    778             err = Errors.E966.format(component=bad_val, name=name)
--> 779             raise ValueError(err)
    780         name = name if name is not None else factory_name
    781         if name in self.component_names:

ValueError: [E966] `nlp.add_pipe` now takes the string name of the registered component factory, not a callable component. Expected string, but got <function set_custom_boundaries at 0x000002520A59CCA0> (name: 'None').

- If you created your component with `nlp.create_pipe('name')`: remove nlp.create_pipe and call `nlp.add_pipe('name')` instead.
    
- If you passed in a component like `TextCategorizer()`: call `nlp.add_pipe` with the string name instead, e.g. `nlp.add_pipe('textcat')`.
    
- If you're using a custom component: Add the decorator `@Language.component` (for function components) or `@Language.factory` (for class components / factories) to your custom component and assign it a name, e.g. `@Language.component('your_name')`. You can then run `nlp.add_pipe('your_name')` to add it to the pipeline.

I looked at some solutions online, however, I could not solve the problem as a beginner in Python. How does one use their own custom segmentation rule in the spaCy pipeline?

Asked By: Seydou GORO

||

Answers:

The syntax of
nlp.add_pipe with a custom function is given here. You must (1) declare the component function with a ‘decorator’ and (2) pass the name of the component/function as a string. So it should be something like this:

@Language.component("set_custom_boundaries")
def set_custom_boundaries(doc):
    for token in doc[:-1]:
        if token.text==";":
            doc[token.i +1].is_sent_start=True
    return doc



nlp.add_pipe("set_custom_boundaries", before='parser')

Note: your function is doing a strange sentence segmentation, it won’t work in general. For example it won’t work if a sentence ends with ‘.’, ‘…’, or ‘!’, etc.

Answered By: Erwan
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.