Scispacy for biomedical named entitiy recognition(NER)

Question:

How to label entities using scispacy?

When I tried to perform NER using scispacy, it identified the biomedical entities by labeling them as Entity but failed to label them as gene/protein, etc.. So how do I do that using scispacy? Or is scispacy not capable of labeling data? The image is attached for reference:
jupyter notebook snippet

Asked By: ishas

||

Answers:

The models en_core_sci_sm, en_core_sci_md and en_core_sci_lg do not name their entities. If you want labeled entities use the models

  • en_ner_craft_md
  • en_ner_jnlpba_md
  • en_ner_bc5cdr_md
  • en_ner_bionlp13cg_md

each of which has its own type of entities see:-

https://allenai.github.io/scispacy/

for more information

Answered By: emptyMug

You can filter the label by ‘GENE_OR_GENE_PRODUCT’ to get all gene names.

import spacy
import scispacy
import en_ner_bionlp13cg_md

document = "We aimed to prospectively compare the risk of early progression according to circulating ESR1 mutations, CA-15.3, and circulating cell-free DNA in MBC patients treated with a first-line aromatase inhibitor (AI)"

nlp = spacy.load("en_ner_bionlp13cg_md")
for X in nlp(document).ents:
    if X.label_=='GENE_OR_GENE_PRODUCT':
        print(X.text)
Answered By: gjkkhfdsd fjkvfhjk

Install required Modules

!pip install spacy
# !pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_ner_bionlp13cg_md-0.5.1.tar.gz
# !pip install scispacy

Load Packages

import scispacy, spacy
sci_nlp = spacy.load("en_ner_bionlp13cg_md")

components of the NLP Object

sci_nlp.component_names

Explore entities

c= 0 
for i in sci_nlp.get_pipe('ner').labels:
    c=c+1
    print(c,"<==>",i)
# output
1 <==> AMINO_ACID
2 <==> ANATOMICAL_SYSTEM
3 <==> CANCER
4 <==> CELL
5 <==> CELLULAR_COMPONENT
6 <==> DEVELOPING_ANATOMICAL_STRUCTURE
7 <==> GENE_OR_GENE_PRODUCT
8 <==> IMMATERIAL_ANATOMICAL_ENTITY
9 <==> MULTI_TISSUE_STRUCTURE
10 <==> ORGAN
11 <==> ORGANISM
12 <==> ORGANISM_SUBDIVISION
13 <==> ORGANISM_SUBSTANCE
14 <==> PATHOLOGICAL_FORMATION
15 <==> SIMPLE_CHEMICAL
16 <==> TISSUE


x = "Med7 — An information extraction model for clinical natural language processing. More information about the model development can be found in recent pre-print: Med7: a transferable clinical natural language processing model for electronic health records."

docx = sci_nlp(x)
for ent in docx.ents:
    print(ent.text,ent.label_)
#output
Med7 GENE_OR_GENE_PRODUCT
Med7 GENE_OR_GENE_PRODUCT

visulaize

from spacy import displacy
displacy.render(docx,style='ent',jupyter=True)

#output
Med7 GENE_OR_GENE_PRODUCT — An information extraction model for clinical natural language processing. More information about the model development can be found in recent pre-print: Med7 GENE_OR_GENE_PRODUCT : a transferable clinical natural language processing model for electronic health records.
Answered By: thrinadhn