How to keep structure of text after feeding it to a pipeline for NER

Question:

I’ve build an NER (named entity recognition) model, based on a HuggingFace existing model and that I fine-tuned to recognize my custom entities. The text I want to run my model on is in a txt file.

The code of how I use the model:

from transformers import pipeline

# loading the fine-tuned model
ner_pipeline =  pipeline('token-classification', model="./my-model.model/", tokenizer="./my-model.model/", ignore_labels=[])

with open(my_file, 'r', encoding="utf8") as f:
  lines = f.readlines()
  joined_lines = ' '.join(lines)

  result = ner_pipeline(joined_lines, aggregation_strategy='first')
  text = ""
      
  for group in result:
     if group["entity_group"] != 'O':
        # substitute the entity with its tag
        text += group["entity_group"]+ " "
     else:
        text += group["word"] + " "

Basically what I do is substituting the entities recognized with the entity tag, and leave the rest of the text as is.

With my code, the final text is filled with the content exactly as I want it, but the structure is lost. While doing ' '.join(lines) I’m basically throwing away the ns inside the text, that however I would like to keep in my reconstructed text.

I’ve tried feeding the pipeline with single sentences (each of the f.readlines()) end not the full joined text, but the results are far worse. The model works a lot better predicting on the whole text.

Does anyone knows a way how I could keep or retrieve the structure of the original text? Thanks.

Asked By: claudia

||

Answers:

The groups have a start and end index that tell you which part of the input string each label corresponds to. I.e., you can pass the text as a whole, with the newlines intact (ner_pipeline(f.read(), ...)) and subsequently replace substrings.

Here’s a working, minimal reproducible example. The only thing to note here is that we replace from right to left (result[::-1]) so we don’t mess up the indices of subsequent labels by changing the length of the string when replacing.

from nltk.corpus import brown # for example data
from transformers import pipeline

ner_pipeline =  pipeline('token-classification')

# equivalent to f.read()
text = 'n'.join(' '.join(sent) for sent in brown.sents()[:100])

result = ner_pipeline(lines_joined, aggregation_strategy='first')

def replace_at(label, start, end, txt):
    """Replace substring of txt from start to end with label"""
    return ''.join((txt[:start], label, txt[end:]))

# Substitution
for group in result[::-1]:
    ent = group["entity_group"]
    if ent != 'ORG': # for testing since there's no 'O' in the default model
        text = replace_at(ent, group['start'], group['end'], text)

sentences = text.split('n')

Example input/output (first line):

"The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced `` no evidence '' that any irregularities took place ."

After processing:

"The Fulton County Grand Jury said Friday an investigation of LOC's recent primary election produced `` no evidence '' that any irregularities took place ."
                                                              ^^^
Answered By: fsimonjetz