spaCy: generalize a language factory that gets a regular expression to create spans in a text
Question:
Working with spaCy it is possible to define spans in a document that correspond to a regular expression matching on the text.
I would like to generalize this into a language factory.
The code to create a span could be like this:
nlp = spacy.load("en_core_web_sm")
text = "this is pepa pig text comprising a brake and fig. 45. The house is white."
doc=nlp(text)
def _component(doc, name, regular_expression):
if name not in doc.spans:
doc.spans[name] = []
for i, match in enumerate(re.finditer(regular_expression, doc.text)):
label = name + "_" + str(i)
start, end = match.span()
span = doc.char_span(start, end, alignment_mode = "expand")
span_to_add = Span(doc, span.start, span.end, label=label)
doc.spans[name].append(span_to_add)
return doc
doc = _component(doc, 'pepapig', r"pepaspig")
I would like to generalize this into a factory.
The factory would take a particular list of regular expressions with names like:
[{'name':'pepapig','rex':r"pepaspig"},{'name':'pepapig2','rex':r"georgespig"}]]
The way I try to do this is as follows (code does not work)
@Language.factory("myregexes6", default_config={})
def add_regex_match_as_span(nlp, name, regular_expressions):
for i,rex_d in enumerate(regular_expressions):
print(rex_d)
name = rex_d['name']
rex = rex_d['rex']
_component(doc, name=name, regular_expression=rex, DEBUG=False)
return doc
nlp.add_pipe(add_regex_match_as_span(nlp, "MC", regular_expressions=[{'name':'pepapig','rex':r"pepaspig"},{'name':'pepapig2','rex':r"georgespig"}]))
I am looking to for the solution to the above code
The error I get is:
[E966] `nlp.add_pipe` now takes the string name of the registered component factory, not a callable component. Expected string, but got this is pepa pig text comprising a brake and fig. 45. The house is white. (name: 'None').
- If you created your component with `nlp.create_pipe('name')`: remove nlp.create_pipe and call `nlp.add_pipe('name')` instead.
- If you passed in a component like `TextCategorizer()`: call `nlp.add_pipe` with the string name instead, e.g. `nlp.add_pipe('textcat')`.
- If you're using a custom component: Add the decorator `@Language.component` (for function components) or `@Language.factory` (for class components / factories) to your custom component and assign it a name, e.g. `@Language.component('your_name')`. You can then run `nlp.add_pipe('your_name')` to add it to the pipeline.
LAST EDIT
How can the factory be saved into a .py file and reread from other file?
Answers:
I think that you need to follow what’s in [the documentation for custom components][1].
Here’s how I tried to solve the problem you’re facing.
I would start first by creating the component which should be a class in this case because you have parameters "a state". In this case, the parameters are a list of {‘name’: name, ‘rex’: rex} called regex_list.
class RegExComponent:
def __init__(self, regex_list):
self.regex_list = regex_list
def __call__(self, doc):
for re_item in self.regex_list:
if re_item['name'] not in doc.spans:
doc.spans[re_item['name']] = []
for i, match in enumerate(re.finditer(re_item['rex'], doc.text)):
label = re_item['name'] + "_" + str(i)
start, end = match.span()
span = doc.char_span(start, end, alignment_mode = "expand")
span_to_add = Span(doc, span.start, span.end, label=label)
doc.spans[re_item['name']].append(span_to_add)
return doc
Now that you have your component, you need a "factory" to create it with the specified parameters. Here’s how you can do it:
@Language.factory("myregex", default_config={})
def create_regex(nlp, name, regex_list):
return RegExComponent(regex_list)
nlp and name should always be there while regex is the input of your regex component.
Here’s a sample of how you can call your newly created component:
regex_list = [{'name':'pepapig','rex':r"pepaspig"},{'name':'pepapig2','rex':r"georgespig"}]
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("myregex", "MC", config={'regex_list': regex_list})
text = "this is pepa pig text comprising a brake and fig. 45. The house is white. Hello george pig"
doc=nlp(text)
print(doc.spans) # {'pepapig': [pepa pig], 'pepapig2': [george pig]}
I hope you found my answer helpful !
[1]: https://spacy.io/usage/processing-pipelines#example-stateful-components
Working with spaCy it is possible to define spans in a document that correspond to a regular expression matching on the text.
I would like to generalize this into a language factory.
The code to create a span could be like this:
nlp = spacy.load("en_core_web_sm")
text = "this is pepa pig text comprising a brake and fig. 45. The house is white."
doc=nlp(text)
def _component(doc, name, regular_expression):
if name not in doc.spans:
doc.spans[name] = []
for i, match in enumerate(re.finditer(regular_expression, doc.text)):
label = name + "_" + str(i)
start, end = match.span()
span = doc.char_span(start, end, alignment_mode = "expand")
span_to_add = Span(doc, span.start, span.end, label=label)
doc.spans[name].append(span_to_add)
return doc
doc = _component(doc, 'pepapig', r"pepaspig")
I would like to generalize this into a factory.
The factory would take a particular list of regular expressions with names like:
[{'name':'pepapig','rex':r"pepaspig"},{'name':'pepapig2','rex':r"georgespig"}]]
The way I try to do this is as follows (code does not work)
@Language.factory("myregexes6", default_config={})
def add_regex_match_as_span(nlp, name, regular_expressions):
for i,rex_d in enumerate(regular_expressions):
print(rex_d)
name = rex_d['name']
rex = rex_d['rex']
_component(doc, name=name, regular_expression=rex, DEBUG=False)
return doc
nlp.add_pipe(add_regex_match_as_span(nlp, "MC", regular_expressions=[{'name':'pepapig','rex':r"pepaspig"},{'name':'pepapig2','rex':r"georgespig"}]))
I am looking to for the solution to the above code
The error I get is:
[E966] `nlp.add_pipe` now takes the string name of the registered component factory, not a callable component. Expected string, but got this is pepa pig text comprising a brake and fig. 45. The house is white. (name: 'None').
- If you created your component with `nlp.create_pipe('name')`: remove nlp.create_pipe and call `nlp.add_pipe('name')` instead.
- If you passed in a component like `TextCategorizer()`: call `nlp.add_pipe` with the string name instead, e.g. `nlp.add_pipe('textcat')`.
- If you're using a custom component: Add the decorator `@Language.component` (for function components) or `@Language.factory` (for class components / factories) to your custom component and assign it a name, e.g. `@Language.component('your_name')`. You can then run `nlp.add_pipe('your_name')` to add it to the pipeline.
LAST EDIT
How can the factory be saved into a .py file and reread from other file?
I think that you need to follow what’s in [the documentation for custom components][1].
Here’s how I tried to solve the problem you’re facing.
I would start first by creating the component which should be a class in this case because you have parameters "a state". In this case, the parameters are a list of {‘name’: name, ‘rex’: rex} called regex_list.
class RegExComponent:
def __init__(self, regex_list):
self.regex_list = regex_list
def __call__(self, doc):
for re_item in self.regex_list:
if re_item['name'] not in doc.spans:
doc.spans[re_item['name']] = []
for i, match in enumerate(re.finditer(re_item['rex'], doc.text)):
label = re_item['name'] + "_" + str(i)
start, end = match.span()
span = doc.char_span(start, end, alignment_mode = "expand")
span_to_add = Span(doc, span.start, span.end, label=label)
doc.spans[re_item['name']].append(span_to_add)
return doc
Now that you have your component, you need a "factory" to create it with the specified parameters. Here’s how you can do it:
@Language.factory("myregex", default_config={})
def create_regex(nlp, name, regex_list):
return RegExComponent(regex_list)
nlp and name should always be there while regex is the input of your regex component.
Here’s a sample of how you can call your newly created component:
regex_list = [{'name':'pepapig','rex':r"pepaspig"},{'name':'pepapig2','rex':r"georgespig"}]
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("myregex", "MC", config={'regex_list': regex_list})
text = "this is pepa pig text comprising a brake and fig. 45. The house is white. Hello george pig"
doc=nlp(text)
print(doc.spans) # {'pepapig': [pepa pig], 'pepapig2': [george pig]}
I hope you found my answer helpful !
[1]: https://spacy.io/usage/processing-pipelines#example-stateful-components