Make Spacy tokenizer not split on /
Question:
How do I modify the English tokenizer to prevent splitting tokens on the '/'
character?
For example, the following string should be one token:
import spacy
nlp = spacy.load('en_core_web_md')
doc = nlp("12/AB/568793")
for t in doc:
print(f"[{t.pos_} {t.text}]")
# produces
#[NUM 12]
#[SYM /]
#[ADJ AB/568793]
Answers:
The approach is a variation on removing a rule in the "Modifying existing rule sets" from Spacy documentation:
nlp = spacy.load('en_core_web_md')
infixes = nlp.Defaults.infixes
assert(len([x for x in infixes if '/' in x])==1) # there seems to just be one rule that splits on /'s
# remove that rule; then modify the tokenizer
infixes = [x for x in infixes if '/' not in x]
nlp.tokenizer.infix_finditer = spacy.util.compile_infix_regex(infixes).finditer
The answer by @Dave is a good starting point, but the correct way I think is to modify instead of deleting the rule.
nlp = spacy.load('en_core_web_md')
infixes = nlp.Defaults.infixes
rule_slash = [x for x in infixes if '/' in x][0]
print(rule_slash) # check the rule
You will see the rule also concerns other characters, including ‘=’,'<‘,’>’ etc.
We only remove ‘/’ from the rule:
rule_slash_new = rule_slash.replace('/', '')
# replace the old rule with the new rule
infixes = [r if r!=rule_slash else rule_slash_new for r in infixes]
nlp.tokenizer.infix_finditer = spacy.util.compile_infix_regex(infixes).finditer
This way the tokenizer will still split correctly in the cases of "A=B" or "A>B" etc.
How do I modify the English tokenizer to prevent splitting tokens on the '/'
character?
For example, the following string should be one token:
import spacy
nlp = spacy.load('en_core_web_md')
doc = nlp("12/AB/568793")
for t in doc:
print(f"[{t.pos_} {t.text}]")
# produces
#[NUM 12]
#[SYM /]
#[ADJ AB/568793]
The approach is a variation on removing a rule in the "Modifying existing rule sets" from Spacy documentation:
nlp = spacy.load('en_core_web_md')
infixes = nlp.Defaults.infixes
assert(len([x for x in infixes if '/' in x])==1) # there seems to just be one rule that splits on /'s
# remove that rule; then modify the tokenizer
infixes = [x for x in infixes if '/' not in x]
nlp.tokenizer.infix_finditer = spacy.util.compile_infix_regex(infixes).finditer
The answer by @Dave is a good starting point, but the correct way I think is to modify instead of deleting the rule.
nlp = spacy.load('en_core_web_md')
infixes = nlp.Defaults.infixes
rule_slash = [x for x in infixes if '/' in x][0]
print(rule_slash) # check the rule
You will see the rule also concerns other characters, including ‘=’,'<‘,’>’ etc.
We only remove ‘/’ from the rule:
rule_slash_new = rule_slash.replace('/', '')
# replace the old rule with the new rule
infixes = [r if r!=rule_slash else rule_slash_new for r in infixes]
nlp.tokenizer.infix_finditer = spacy.util.compile_infix_regex(infixes).finditer
This way the tokenizer will still split correctly in the cases of "A=B" or "A>B" etc.