Scispacy for biomedical named entitiy recognition(NER)
Question:
How to label entities using scispacy?
When I tried to perform NER using scispacy
, it identified the biomedical entities by labeling them as Entity
but failed to label them as gene/protein, etc.. So how do I do that using scispacy
? Or is scispacy
not capable of labeling data? The image is attached for reference:
jupyter notebook snippet
Answers:
The models en_core_sci_sm
, en_core_sci_md
and en_core_sci_lg
do not name their entities. If you want labeled entities use the models
- en_ner_craft_md
- en_ner_jnlpba_md
- en_ner_bc5cdr_md
- en_ner_bionlp13cg_md
each of which has its own type of entities see:-
https://allenai.github.io/scispacy/
for more information
You can filter the label by ‘GENE_OR_GENE_PRODUCT’ to get all gene names.
import spacy
import scispacy
import en_ner_bionlp13cg_md
document = "We aimed to prospectively compare the risk of early progression according to circulating ESR1 mutations, CA-15.3, and circulating cell-free DNA in MBC patients treated with a first-line aromatase inhibitor (AI)"
nlp = spacy.load("en_ner_bionlp13cg_md")
for X in nlp(document).ents:
if X.label_=='GENE_OR_GENE_PRODUCT':
print(X.text)
Install required Modules
!pip install spacy
# !pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_ner_bionlp13cg_md-0.5.1.tar.gz
# !pip install scispacy
Load Packages
import scispacy, spacy
sci_nlp = spacy.load("en_ner_bionlp13cg_md")
components of the NLP Object
sci_nlp.component_names
Explore entities
c= 0
for i in sci_nlp.get_pipe('ner').labels:
c=c+1
print(c,"<==>",i)
# output
1 <==> AMINO_ACID
2 <==> ANATOMICAL_SYSTEM
3 <==> CANCER
4 <==> CELL
5 <==> CELLULAR_COMPONENT
6 <==> DEVELOPING_ANATOMICAL_STRUCTURE
7 <==> GENE_OR_GENE_PRODUCT
8 <==> IMMATERIAL_ANATOMICAL_ENTITY
9 <==> MULTI_TISSUE_STRUCTURE
10 <==> ORGAN
11 <==> ORGANISM
12 <==> ORGANISM_SUBDIVISION
13 <==> ORGANISM_SUBSTANCE
14 <==> PATHOLOGICAL_FORMATION
15 <==> SIMPLE_CHEMICAL
16 <==> TISSUE
x = "Med7 — An information extraction model for clinical natural language processing. More information about the model development can be found in recent pre-print: Med7: a transferable clinical natural language processing model for electronic health records."
docx = sci_nlp(x)
for ent in docx.ents:
print(ent.text,ent.label_)
#output
Med7 GENE_OR_GENE_PRODUCT
Med7 GENE_OR_GENE_PRODUCT
visulaize
from spacy import displacy
displacy.render(docx,style='ent',jupyter=True)
#output
Med7 GENE_OR_GENE_PRODUCT — An information extraction model for clinical natural language processing. More information about the model development can be found in recent pre-print: Med7 GENE_OR_GENE_PRODUCT : a transferable clinical natural language processing model for electronic health records.
How to label entities using scispacy?
When I tried to perform NER using scispacy
, it identified the biomedical entities by labeling them as Entity
but failed to label them as gene/protein, etc.. So how do I do that using scispacy
? Or is scispacy
not capable of labeling data? The image is attached for reference:
jupyter notebook snippet
The models en_core_sci_sm
, en_core_sci_md
and en_core_sci_lg
do not name their entities. If you want labeled entities use the models
- en_ner_craft_md
- en_ner_jnlpba_md
- en_ner_bc5cdr_md
- en_ner_bionlp13cg_md
each of which has its own type of entities see:-
https://allenai.github.io/scispacy/
for more information
You can filter the label by ‘GENE_OR_GENE_PRODUCT’ to get all gene names.
import spacy
import scispacy
import en_ner_bionlp13cg_md
document = "We aimed to prospectively compare the risk of early progression according to circulating ESR1 mutations, CA-15.3, and circulating cell-free DNA in MBC patients treated with a first-line aromatase inhibitor (AI)"
nlp = spacy.load("en_ner_bionlp13cg_md")
for X in nlp(document).ents:
if X.label_=='GENE_OR_GENE_PRODUCT':
print(X.text)
Install required Modules
!pip install spacy
# !pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_ner_bionlp13cg_md-0.5.1.tar.gz
# !pip install scispacy
Load Packages
import scispacy, spacy
sci_nlp = spacy.load("en_ner_bionlp13cg_md")
components of the NLP Object
sci_nlp.component_names
Explore entities
c= 0
for i in sci_nlp.get_pipe('ner').labels:
c=c+1
print(c,"<==>",i)
# output
1 <==> AMINO_ACID
2 <==> ANATOMICAL_SYSTEM
3 <==> CANCER
4 <==> CELL
5 <==> CELLULAR_COMPONENT
6 <==> DEVELOPING_ANATOMICAL_STRUCTURE
7 <==> GENE_OR_GENE_PRODUCT
8 <==> IMMATERIAL_ANATOMICAL_ENTITY
9 <==> MULTI_TISSUE_STRUCTURE
10 <==> ORGAN
11 <==> ORGANISM
12 <==> ORGANISM_SUBDIVISION
13 <==> ORGANISM_SUBSTANCE
14 <==> PATHOLOGICAL_FORMATION
15 <==> SIMPLE_CHEMICAL
16 <==> TISSUE
x = "Med7 — An information extraction model for clinical natural language processing. More information about the model development can be found in recent pre-print: Med7: a transferable clinical natural language processing model for electronic health records."
docx = sci_nlp(x)
for ent in docx.ents:
print(ent.text,ent.label_)
#output
Med7 GENE_OR_GENE_PRODUCT
Med7 GENE_OR_GENE_PRODUCT
visulaize
from spacy import displacy
displacy.render(docx,style='ent',jupyter=True)
#output
Med7 GENE_OR_GENE_PRODUCT — An information extraction model for clinical natural language processing. More information about the model development can be found in recent pre-print: Med7 GENE_OR_GENE_PRODUCT : a transferable clinical natural language processing model for electronic health records.