ValueError: [E024] Could not find an optimal move to supervise the parser

Question:

I am getting the following error while training spacy NER model with my custom training data.

ValueError: [E024] Could not find an optimal move to supervise the parser. Usually, this means the GoldParse was not correct. For example, are all labels added to the model?

Can anyone help me with this?

Asked By: Siddharth Das

||

Answers:

passing the training data through this function below works fine without any error.

def trim_entity_spans(data: list) -> list:
    """Removes leading and trailing white spaces from entity spans.

    Args:
        data (list): The data to be cleaned in spaCy JSON format.

    Returns:
        list: The cleaned data.
    """
    invalid_span_tokens = re.compile(r's')

    cleaned_data = []
    for text, annotations in data:
        entities = annotations['entities']
        valid_entities = []
        for start, end, label in entities:
            valid_start = start
            valid_end = end
            while valid_start < len(text) and invalid_span_tokens.match(
                    text[valid_start]):
                valid_start += 1
            while valid_end > 1 and invalid_span_tokens.match(
                    text[valid_end - 1]):
                valid_end -= 1
            valid_entities.append([valid_start, valid_end, label])
        cleaned_data.append([text, {'entities': valid_entities}])

    return cleaned_data
Answered By: Siddharth Das

This happens when there is an empty content (data) in your annotation. Examples of empty data may include tags, label, starting and ending points of your label. The solution provided above should work for trimming/cleansing the data. However, if you want a brute force approach, just include an exception handler before updating the model, as follows:

def train_spacy(data,iterations):
    nlp = spacy.blank('en')  # create blank Language class
    # create the built-in pipeline components and add them to the pipeline
    # nlp.create_pipe works for built-ins that are registered with spaCy
    if 'ner' not in nlp.pipe_names:
        ner = nlp.create_pipe('ner')
        nlp.add_pipe(ner, last=True) 

    #add labels
    for _, annotations in TRAIN_DATA:
          for ent in annotations.get('entities'):
            ner.add_label(ent[2])
          
    # get names of other pipes to disable them during training
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
    with nlp.disable_pipes(*other_pipes):  # only train NER
        optimizer = nlp.begin_training()
        for itn in range(iterations):
            print("Starting iteration " + str(itn))
            random.shuffle(TRAIN_DATA)
            losses = {}
            for text, annotations in TRAIN_DATA:
                try:
                    nlp.update(
                        [text],  
                        [annotations],  
                        drop=0.2,  
                        sgd=optimizer,  
                        losses=losses)
                except Exception as error:
                    print(error)
                    continue
            print(losses)
    return nlp

So assuming your TRAIN_DATA contains 1000 rows and only row number 200 has an empty data, instead of the model throwing the error, it will always skip number 200 and train the remaining the data.

Answered By: Olasimbo Arigbabu

for spacy v3 supported data format for training…
pass in the list of training data via this function…

def clean_entity_spans(data: list) -> list:
  invalid_span_tokens = re.compile(r's')

  cleaned_data = []

  for content in data:
      name = content['documentName']
      text = content['document']
      userinput = content['user_input']

      valid_entities = []

      for annotate_content in content['annotation']:
          start = annotate_content['start']
          end = annotate_content['end']
          label = annotate_content['label']
          text1 = annotate_content['text']

          valid_start = start
          valid_end = end

          while valid_start < len(text) and invalid_span_tokens.match(
                  text[valid_start]):
              valid_start += 1
          while valid_end > 1 and invalid_span_tokens.match(
                  text[valid_end - 1]):
              valid_end -= 1
          
          valid_entities.append({'start': valid_start, 'end': valid_end, 'label': label, 'text': text1, 'propertiesList': [], 'commentsList': []})
      cleaned_data.append({'documentName': name, 'document':text, 'annotation': valid_entities, 'user_input': userinput})

  return cleaned_data
Answered By: Kamal Godar