Extracting start and end indices of a token using spacy
Question:
I am looking at lots of sentences and looking to extract the start and end indices of a word in a given sentence.
For example, the input is as follows:
"This is a sentence written in English by a native English speaker."
And What I want is the span of the word ‘English’ which in this case is : (30,37) and (50, 57).
Note: I was pointed to this answer (Get position of word in sentence with spacy)
But this answer doesn’t solve my problem. It can help me in getting the start character of the token but not the end index.
All help appreciated
Answers:
You can do this with re in pure python:
s="This is a sentence written in english by a native English speaker."
import re
[(i.start(), i.end()) for i in re.finditer('ENGLISH', s.upper())]
#output
[(30, 37), (50, 57)]
You can do in spacy as well:
import spacy
nlp=spacy.load("en_core_web_sm")
doc=nlp("This is a sentence written in english by a native English speaker.")
for ent in doc.ents:
if ent.text.upper()=='ENGLISH':
print(ent.start_char,ent.end_char)
Using the idea from the answer you link you could do something like this
from spacy.lang.en import English
nlp = English()
s = nlp("This is a sentence written in english by a native English speaker")
boundaries = []
for idx, i in enumerate(s[:-1]):
if i.text.lower() == "english":
boundaries.append((i.idx, s[idx+1].idx-1))
You can simply do it like this using SpaCy, which do not need any check for the last token (unlike @giovanni’s solution):
def get_char_span(input_txt):
doc = nlp(input_txt)
for i, token in enumerate(doc):
start_i = token.idx
end_i = start_i + len(token.text)
# token span and the token
print(i, token)
# character span
print((start_i, end_i))
# veryfying it in the original input_text
print(input_txt[start_i:end_i])
inp = "My name is X, what's your name?"
get_char_span(inp)
I am looking at lots of sentences and looking to extract the start and end indices of a word in a given sentence.
For example, the input is as follows:
"This is a sentence written in English by a native English speaker."
And What I want is the span of the word ‘English’ which in this case is : (30,37) and (50, 57).
Note: I was pointed to this answer (Get position of word in sentence with spacy)
But this answer doesn’t solve my problem. It can help me in getting the start character of the token but not the end index.
All help appreciated
You can do this with re in pure python:
s="This is a sentence written in english by a native English speaker."
import re
[(i.start(), i.end()) for i in re.finditer('ENGLISH', s.upper())]
#output
[(30, 37), (50, 57)]
You can do in spacy as well:
import spacy
nlp=spacy.load("en_core_web_sm")
doc=nlp("This is a sentence written in english by a native English speaker.")
for ent in doc.ents:
if ent.text.upper()=='ENGLISH':
print(ent.start_char,ent.end_char)
Using the idea from the answer you link you could do something like this
from spacy.lang.en import English
nlp = English()
s = nlp("This is a sentence written in english by a native English speaker")
boundaries = []
for idx, i in enumerate(s[:-1]):
if i.text.lower() == "english":
boundaries.append((i.idx, s[idx+1].idx-1))
You can simply do it like this using SpaCy, which do not need any check for the last token (unlike @giovanni’s solution):
def get_char_span(input_txt):
doc = nlp(input_txt)
for i, token in enumerate(doc):
start_i = token.idx
end_i = start_i + len(token.text)
# token span and the token
print(i, token)
# character span
print((start_i, end_i))
# veryfying it in the original input_text
print(input_txt[start_i:end_i])
inp = "My name is X, what's your name?"
get_char_span(inp)