spaCy: generalize a language factory that gets a regular expression to create spans in a text

Question:

Working with spaCy it is possible to define spans in a document that correspond to a regular expression matching on the text.
I would like to generalize this into a language factory.

The code to create a span could be like this:

nlp = spacy.load("en_core_web_sm")
text = "this is pepa pig text comprising a brake and fig. 45. The house is white."
doc=nlp(text)
def _component(doc, name, regular_expression):
    if name not in doc.spans:
        doc.spans[name] = []
    for i, match in enumerate(re.finditer(regular_expression, doc.text)):
        label = name + "_" + str(i)
        start, end = match.span()
        span = doc.char_span(start, end, alignment_mode = "expand")
        span_to_add = Span(doc, span.start, span.end, label=label)

        doc.spans[name].append(span_to_add)
    return doc
doc = _component(doc, 'pepapig', r"pepaspig")  

I would like to generalize this into a factory.
The factory would take a particular list of regular expressions with names like:

[{'name':'pepapig','rex':r"pepaspig"},{'name':'pepapig2','rex':r"georgespig"}]]

The way I try to do this is as follows (code does not work)

@Language.factory("myregexes6", default_config={})
def add_regex_match_as_span(nlp, name, regular_expressions):   
    for i,rex_d in enumerate(regular_expressions):
        print(rex_d)
        name = rex_d['name']
        rex = rex_d['rex']
        _component(doc, name=name, regular_expression=rex, DEBUG=False)

    return doc

nlp.add_pipe(add_regex_match_as_span(nlp, "MC", regular_expressions=[{'name':'pepapig','rex':r"pepaspig"},{'name':'pepapig2','rex':r"georgespig"}]))

I am looking to for the solution to the above code

The error I get is:

[E966] `nlp.add_pipe` now takes the string name of the registered component factory, not a callable component. Expected string, but got this is pepa pig text comprising a brake and fig. 45. The house is white. (name: 'None').

- If you created your component with `nlp.create_pipe('name')`: remove nlp.create_pipe and call `nlp.add_pipe('name')` instead.

- If you passed in a component like `TextCategorizer()`: call `nlp.add_pipe` with the string name instead, e.g. `nlp.add_pipe('textcat')`.

- If you're using a custom component: Add the decorator `@Language.component` (for function components) or `@Language.factory` (for class components / factories) to your custom component and assign it a name, e.g. `@Language.component('your_name')`. You can then run `nlp.add_pipe('your_name')` to add it to the pipeline.

LAST EDIT

How can the factory be saved into a .py file and reread from other file?

Asked By: JFerro

||

Answers:

I think that you need to follow what’s in [the documentation for custom components][1].
Here’s how I tried to solve the problem you’re facing.
I would start first by creating the component which should be a class in this case because you have parameters "a state". In this case, the parameters are a list of {‘name’: name, ‘rex’: rex} called regex_list.

class RegExComponent:
  def __init__(self, regex_list):
    self.regex_list = regex_list
  
  def __call__(self, doc):
    for re_item in self.regex_list:
      if re_item['name'] not in doc.spans:
        
        doc.spans[re_item['name']] = []
      for i, match in enumerate(re.finditer(re_item['rex'], doc.text)):
          label = re_item['name'] + "_" + str(i)
          start, end = match.span()
          span = doc.char_span(start, end, alignment_mode = "expand")
          span_to_add = Span(doc, span.start, span.end, label=label)

          doc.spans[re_item['name']].append(span_to_add)
    return doc

Now that you have your component, you need a "factory" to create it with the specified parameters. Here’s how you can do it:

@Language.factory("myregex", default_config={})
def create_regex(nlp, name, regex_list):   
    return RegExComponent(regex_list)

nlp and name should always be there while regex is the input of your regex component.
Here’s a sample of how you can call your newly created component:

regex_list = [{'name':'pepapig','rex':r"pepaspig"},{'name':'pepapig2','rex':r"georgespig"}]
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("myregex", "MC", config={'regex_list': regex_list})
text = "this is pepa pig text comprising a brake and fig. 45. The house is white. Hello george pig"
doc=nlp(text)
print(doc.spans)  # {'pepapig': [pepa pig], 'pepapig2': [george pig]}

I hope you found my answer helpful !
[1]: https://spacy.io/usage/processing-pipelines#example-stateful-components

Answered By: Hannibal
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.