Extracting start and end indices of a token using spacy

Question

I am looking at lots of sentences and looking to extract the start and end indices of a word in a given sentence.

For example, the input is as follows:

"This is a sentence written in English by a native English speaker."

And What I want is the span of the word ‘English’ which in this case is : (30,37) and (50, 57).

Note: I was pointed to this answer (Get position of word in sentence with spacy)

But this answer doesn’t solve my problem. It can help me in getting the start character of the token but not the end index.

All help appreciated

Asked By: ary

||

Source

Answer 1

You can do this with re in pure python:

s="This is a sentence written in english by a native English speaker."

import re
[(i.start(), i.end()) for i in re.finditer('ENGLISH', s.upper())]

#output
[(30, 37), (50, 57)]

You can do in spacy as well:

import spacy
nlp=spacy.load("en_core_web_sm")
doc=nlp("This is a sentence written in english by a native English speaker.")
for ent in doc.ents:
    if ent.text.upper()=='ENGLISH':
      print(ent.start_char,ent.end_char)

Answered By: God Is One

Answer 2

Using the idea from the answer you link you could do something like this

from spacy.lang.en import English
nlp = English()
s = nlp("This is a sentence written in english by a native English speaker")
boundaries = []
for idx, i in enumerate(s[:-1]):
    if i.text.lower() == "english":
        boundaries.append((i.idx, s[idx+1].idx-1))

Answered By: Giovanni

Answer 3

You can simply do it like this using SpaCy, which do not need any check for the last token (unlike @giovanni’s solution):

def get_char_span(input_txt):

  doc = nlp(input_txt)

  for i, token in enumerate(doc):
    start_i = token.idx
    end_i = start_i + len(token.text)

    # token span and the token
    print(i, token)
    # character span
    print((start_i, end_i))
    # veryfying it in the original input_text
    print(input_txt[start_i:end_i])

inp = "My name is X, what's your name?"
get_char_span(inp)

Answered By: A'r SHAON

Extracting start and end indices of a token using spacy

Question:

Answers: