ValueError: [E024] Could not find an optimal move to supervise the parser
Question:
I am getting the following error while training spacy
NER model with my custom training data.
ValueError: [E024] Could not find an optimal move to supervise the parser. Usually, this means the GoldParse was not correct. For example, are all labels added to the model?
Can anyone help me with this?
Answers:
passing the training data through this function below works fine without any error.
def trim_entity_spans(data: list) -> list:
"""Removes leading and trailing white spaces from entity spans.
Args:
data (list): The data to be cleaned in spaCy JSON format.
Returns:
list: The cleaned data.
"""
invalid_span_tokens = re.compile(r's')
cleaned_data = []
for text, annotations in data:
entities = annotations['entities']
valid_entities = []
for start, end, label in entities:
valid_start = start
valid_end = end
while valid_start < len(text) and invalid_span_tokens.match(
text[valid_start]):
valid_start += 1
while valid_end > 1 and invalid_span_tokens.match(
text[valid_end - 1]):
valid_end -= 1
valid_entities.append([valid_start, valid_end, label])
cleaned_data.append([text, {'entities': valid_entities}])
return cleaned_data
This happens when there is an empty content (data) in your annotation. Examples of empty data may include tags, label, starting and ending points of your label. The solution provided above should work for trimming/cleansing the data. However, if you want a brute force approach, just include an exception handler before updating the model, as follows:
def train_spacy(data,iterations):
nlp = spacy.blank('en') # create blank Language class
# create the built-in pipeline components and add them to the pipeline
# nlp.create_pipe works for built-ins that are registered with spaCy
if 'ner' not in nlp.pipe_names:
ner = nlp.create_pipe('ner')
nlp.add_pipe(ner, last=True)
#add labels
for _, annotations in TRAIN_DATA:
for ent in annotations.get('entities'):
ner.add_label(ent[2])
# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
with nlp.disable_pipes(*other_pipes): # only train NER
optimizer = nlp.begin_training()
for itn in range(iterations):
print("Starting iteration " + str(itn))
random.shuffle(TRAIN_DATA)
losses = {}
for text, annotations in TRAIN_DATA:
try:
nlp.update(
[text],
[annotations],
drop=0.2,
sgd=optimizer,
losses=losses)
except Exception as error:
print(error)
continue
print(losses)
return nlp
So assuming your TRAIN_DATA contains 1000 rows and only row number 200 has an empty data, instead of the model throwing the error, it will always skip number 200 and train the remaining the data.
for spacy v3 supported data format for training…
pass in the list of training data via this function…
def clean_entity_spans(data: list) -> list:
invalid_span_tokens = re.compile(r's')
cleaned_data = []
for content in data:
name = content['documentName']
text = content['document']
userinput = content['user_input']
valid_entities = []
for annotate_content in content['annotation']:
start = annotate_content['start']
end = annotate_content['end']
label = annotate_content['label']
text1 = annotate_content['text']
valid_start = start
valid_end = end
while valid_start < len(text) and invalid_span_tokens.match(
text[valid_start]):
valid_start += 1
while valid_end > 1 and invalid_span_tokens.match(
text[valid_end - 1]):
valid_end -= 1
valid_entities.append({'start': valid_start, 'end': valid_end, 'label': label, 'text': text1, 'propertiesList': [], 'commentsList': []})
cleaned_data.append({'documentName': name, 'document':text, 'annotation': valid_entities, 'user_input': userinput})
return cleaned_data
I am getting the following error while training spacy
NER model with my custom training data.
ValueError: [E024] Could not find an optimal move to supervise the parser. Usually, this means the GoldParse was not correct. For example, are all labels added to the model?
Can anyone help me with this?
passing the training data through this function below works fine without any error.
def trim_entity_spans(data: list) -> list:
"""Removes leading and trailing white spaces from entity spans.
Args:
data (list): The data to be cleaned in spaCy JSON format.
Returns:
list: The cleaned data.
"""
invalid_span_tokens = re.compile(r's')
cleaned_data = []
for text, annotations in data:
entities = annotations['entities']
valid_entities = []
for start, end, label in entities:
valid_start = start
valid_end = end
while valid_start < len(text) and invalid_span_tokens.match(
text[valid_start]):
valid_start += 1
while valid_end > 1 and invalid_span_tokens.match(
text[valid_end - 1]):
valid_end -= 1
valid_entities.append([valid_start, valid_end, label])
cleaned_data.append([text, {'entities': valid_entities}])
return cleaned_data
This happens when there is an empty content (data) in your annotation. Examples of empty data may include tags, label, starting and ending points of your label. The solution provided above should work for trimming/cleansing the data. However, if you want a brute force approach, just include an exception handler before updating the model, as follows:
def train_spacy(data,iterations):
nlp = spacy.blank('en') # create blank Language class
# create the built-in pipeline components and add them to the pipeline
# nlp.create_pipe works for built-ins that are registered with spaCy
if 'ner' not in nlp.pipe_names:
ner = nlp.create_pipe('ner')
nlp.add_pipe(ner, last=True)
#add labels
for _, annotations in TRAIN_DATA:
for ent in annotations.get('entities'):
ner.add_label(ent[2])
# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
with nlp.disable_pipes(*other_pipes): # only train NER
optimizer = nlp.begin_training()
for itn in range(iterations):
print("Starting iteration " + str(itn))
random.shuffle(TRAIN_DATA)
losses = {}
for text, annotations in TRAIN_DATA:
try:
nlp.update(
[text],
[annotations],
drop=0.2,
sgd=optimizer,
losses=losses)
except Exception as error:
print(error)
continue
print(losses)
return nlp
So assuming your TRAIN_DATA contains 1000 rows and only row number 200 has an empty data, instead of the model throwing the error, it will always skip number 200 and train the remaining the data.
for spacy v3 supported data format for training…
pass in the list of training data via this function…
def clean_entity_spans(data: list) -> list:
invalid_span_tokens = re.compile(r's')
cleaned_data = []
for content in data:
name = content['documentName']
text = content['document']
userinput = content['user_input']
valid_entities = []
for annotate_content in content['annotation']:
start = annotate_content['start']
end = annotate_content['end']
label = annotate_content['label']
text1 = annotate_content['text']
valid_start = start
valid_end = end
while valid_start < len(text) and invalid_span_tokens.match(
text[valid_start]):
valid_start += 1
while valid_end > 1 and invalid_span_tokens.match(
text[valid_end - 1]):
valid_end -= 1
valid_entities.append({'start': valid_start, 'end': valid_end, 'label': label, 'text': text1, 'propertiesList': [], 'commentsList': []})
cleaned_data.append({'documentName': name, 'document':text, 'annotation': valid_entities, 'user_input': userinput})
return cleaned_data