How to normalise keywords extracted with Named Entity Recognition

Question:

I’m trying to employ NER to extract keywords (tags) from job postings. This can be anything along with React, AWS, Team Building, Marketing.

After training a custom model in SpaCy I’m presented with a problem – extracted tags are not unified/normalized across all of the data.

For example, if job posting is about frontend development, NER can extract the keyword frontend in many ways (depending on job description), for example: Frontend, Front End, Front-End, front-end and so on.

Is there a reliable way to normalise/unify the extracted keywords? All the keywords go directly into the database and, with all the variants of each keyword, I would end up with too much noise.

One way to tackle the problem would be to create mappings such as:

"Frontend": ["Front End", "Front-End", "front-end"]

but that approach seems not too bright. Perhaps within SpaCy itself there’s an option to normalise tags?

Asked By: Pono

||

Answers:

Certainly these simple rules can quickly help you to collapse similar s strings:

  • s.lower()
  • s.replace("-", " ")
  • s.replace(" ", "")

There are several
phonetic algorithms
such as
Metaphone,
that are good at collapsing "sounds alike" variants
into a single base entity.

A frequent bi-gram analysis may help you to identify
common two-word phrases that denote a single entity.

Spacy’s token.lemma_ and token.text can help with stemming.

Learning that e.g. "React" and "Frontend" are more or less synonyms
in this context would require a heavier weight approach, such as word2vec,
WordNet,
or a LLM like ChatGPT.

Answered By: J_H

To supplement J_H’s great answer, if we want to find related terms like "React" and "frontend", this can be done with spacy out of the box. E.g., let’s find all the named entities from the first paragraph of the Wikipedia entry for Charles Dickens and cluster them.

$ python -m spacy download en_core_web_lg  # 600 MiB, only need to do this once
import spacy
import numpy as np
import pandas as pd
from sklearn.cluster import DBSCAN


nlp = spacy.load("en_core_web_lg")
paragraph = """
Charles John Huffam Dickens (/ˈdɪkɪnz/; 7 February 1812 – 9 June 1870) was an English writer and social critic. He created some of the world's best-known fictional characters and is regarded by many as the greatest novelist of the Victorian era.[1] His works enjoyed unprecedented popularity during his lifetime and, by the 20th century, critics and scholars had recognised him as a literary genius. His novels and short stories are widely read today.[2][3]

Born in Portsmouth, Dickens left school at the age of 12 to work in a boot-blacking factory when his father was incarcerated in a debtors' prison. After three years he returned to school, before he began his literary career as a journalist. Dickens edited a weekly journal for 20 years, wrote 15 novels, five novellas, hundreds of short stories and non-fiction articles, lectured and performed readings extensively, was an indefatigable letter writer, and campaigned vigorously for children's rights, for education, and for other social reforms.

Dickens's literary success began with the 1836 serial publication of The Pickwick Papers, a publishing phenomenon—thanks largely to the introduction of the character Sam Weller in the fourth episode—that sparked Pickwick merchandise and spin-offs. Within a few years Dickens had become an international literary celebrity, famous for his humour, satire and keen observation of character and society. His novels, most of them published in monthly or weekly installments, pioneered the serial publication of narrative fiction, which became the dominant Victorian mode for novel publication.[4][5] Cliffhanger endings in his serial publications kept readers in suspense.[6] The instalment format allowed Dickens to evaluate his audience's reaction, and he often modified his plot and character development based on such feedback.[5] For example, when his wife's chiropodist expressed distress at the way Miss Mowcher in David Copperfield seemed to reflect her own disabilities, Dickens improved the character with positive features.[7] His plots were carefully constructed and he often wove elements from topical events into his narratives.[8] Masses of the illiterate poor would individually pay a halfpenny to have each new monthly episode read to them, opening up and inspiring a new class of readers.[9]

His 1843 novella A Christmas Carol remains especially popular and continues to inspire adaptations in every artistic genre. Oliver Twist and Great Expectations are also frequently adapted and, like many of his novels, evoke images of early Victorian London. His 1859 novel A Tale of Two Cities (set in London and Paris) is his best-known work of historical fiction. The most famous celebrity of his era, he undertook, in response to public demand, a series of public reading tours in the later part of his career.[10] The term Dickensian is used to describe something that is reminiscent of Dickens and his writings, such as poor social or working conditions, or comically repulsive characters."""

doc = nlp(paragraph)
df = pd.DataFrame([(e.text, e.label_, np.array(e.vector)) for e in doc.ents], columns=['text', 'type', 'vec'])
X = np.vstack(df.vec.to_numpy())
dbscan = DBSCAN(metric='cosine', min_samples=1, eps=0.4)
df['cluster'] = dbscan.fit_predict(X)

Finally, let’s display the clusters:

groups = df.groupby(by=['cluster'])['text']
for g in groups:
    print(g[-1].values)

Resulting in

['Charles John Huffam Dickens' 'Dickens' 'Dickens' 'Dickens' 'Dickens'
 'Dickens' 'David Copperfield' 'Dickens' 'Dickens']
['7 February 1812']
['English']
['the 20th century']
['Portsmouth']
['the age of 12' 'three years' '20 years' '15' 'five' 'fourth'
 'a few years' 'A Tale of Two Cities']
['weekly' 'monthly' 'weekly' 'monthly']
['hundreds']
['1836' '1843' '1859']
['The Pickwick Papers' 'Pickwick']
['Sam Weller']
['Victorian' 'early Victorian London' 'London' 'Paris']
['Mowcher']
['a halfpenny']
['A Christmas Carol']
['Oliver Twist']
Answered By: dimid