How to extract cities with Spacy / Can't load French model
Question:
I know it’s perhaps an easy question but i’m not very familiar with Spacy.
So i’m trying to extract cities in a text file.
My code is that :
pip install spacy_lefff
pip install spacy download fr
import spacy
from spacy_lefff import LefffLemmatizer
from spacy.language import Language
@Language.factory('french_lemmatizer')
def create_french_lemmatizer(nlp, name):
#return LefffLemmatizer()
nlp = spacy.load('fr_core_news_sm')
nlp.add_pipe('french_lemmatizer', name='lefff')
doc = nlp(u"Apple cherche a acheter une startup anglaise pour 1 milliard de dollard")
for d in doc:
print(d.text, d.pos_, d._.lefff_lemma, d.tag_, d.lemma_)
import spacy
nlp = spacy.load("en_core_web_sm")
import os
from google.colab import drive
drive.mount('/content/drive/', force_remount=True)
if not os.path.exists('/content/drive/My Drive/Miserables'):
os.makedirs('/content/drive/My Drive/Miserables')
root_dir = '/content/drive/My Drive/Miserables/'
os.listdir('/content/drive/My Drive/Miserables')
with open("/content/drive/My Drive/Miserables/miserable.txt", 'r') as f:
myString = f.read()
doc = nlp(open('/content/drive/My Drive/Miserables/miserable - 1.txt').read())
for ent in doc.ents:
if (ent.label_ == 'GPE'):
gpe.append(ent.text)
elif (ent.label_ == 'LOC'):
loc.append(ent.text)
cities = []
countries = []
other_places = []
import wikipedia
for text in gpe:
summary = str(wikipedia.summary(text),"html.parser")
if ('city' in summary):
cities.append(text)
elif ('country' in summary):
countries.append(text)
else:
other_places.append(text)
for text in loc:
other_places.append(text)
TypeError: decoding str is not supported
Can’t load french spacy model? I don’t know why, i’m trying but it doesnt code.
Thanks for your help.
Answers:
My text is in French
I just skimmed some of the source code for locationtagger
, and it appears that it hardcodes usage of the en_core_web_sm
model. It likely does not form correct parses of your input text.
I would not use nltk
or locationtagger
for this task.
Instead, download a proper spaCy model for French:
python3 -m spacy download fr_core_news_{sm|md|lg|trf}
Read spaCy’s documentation on named entity recognition [1]. This includes information about identifying geopolitical entities ("GPE").
The default spaCy models will tag cities, states/provinces/districts, and countries under the "GPE" tag. If you are interested only in the cities, then, you should filter the found GPEs against the data in locationtagger
‘s City-Region-Locations.csv
.
Additionally, you may wish to segment the text by paragraph and use spaCy’s nlp.pipe
to process paragraphs in parallel.
I know it’s perhaps an easy question but i’m not very familiar with Spacy.
So i’m trying to extract cities in a text file.
My code is that :
pip install spacy_lefff
pip install spacy download fr
import spacy
from spacy_lefff import LefffLemmatizer
from spacy.language import Language
@Language.factory('french_lemmatizer')
def create_french_lemmatizer(nlp, name):
#return LefffLemmatizer()
nlp = spacy.load('fr_core_news_sm')
nlp.add_pipe('french_lemmatizer', name='lefff')
doc = nlp(u"Apple cherche a acheter une startup anglaise pour 1 milliard de dollard")
for d in doc:
print(d.text, d.pos_, d._.lefff_lemma, d.tag_, d.lemma_)
import spacy
nlp = spacy.load("en_core_web_sm")
import os
from google.colab import drive
drive.mount('/content/drive/', force_remount=True)
if not os.path.exists('/content/drive/My Drive/Miserables'):
os.makedirs('/content/drive/My Drive/Miserables')
root_dir = '/content/drive/My Drive/Miserables/'
os.listdir('/content/drive/My Drive/Miserables')
with open("/content/drive/My Drive/Miserables/miserable.txt", 'r') as f:
myString = f.read()
doc = nlp(open('/content/drive/My Drive/Miserables/miserable - 1.txt').read())
for ent in doc.ents:
if (ent.label_ == 'GPE'):
gpe.append(ent.text)
elif (ent.label_ == 'LOC'):
loc.append(ent.text)
cities = []
countries = []
other_places = []
import wikipedia
for text in gpe:
summary = str(wikipedia.summary(text),"html.parser")
if ('city' in summary):
cities.append(text)
elif ('country' in summary):
countries.append(text)
else:
other_places.append(text)
for text in loc:
other_places.append(text)
TypeError: decoding str is not supported
Can’t load french spacy model? I don’t know why, i’m trying but it doesnt code.
Thanks for your help.
My text is in French
I just skimmed some of the source code for locationtagger
, and it appears that it hardcodes usage of the en_core_web_sm
model. It likely does not form correct parses of your input text.
I would not use nltk
or locationtagger
for this task.
Instead, download a proper spaCy model for French:
python3 -m spacy download fr_core_news_{sm|md|lg|trf}
Read spaCy’s documentation on named entity recognition [1]. This includes information about identifying geopolitical entities ("GPE").
The default spaCy models will tag cities, states/provinces/districts, and countries under the "GPE" tag. If you are interested only in the cities, then, you should filter the found GPEs against the data in locationtagger
‘s City-Region-Locations.csv
.
Additionally, you may wish to segment the text by paragraph and use spaCy’s nlp.pipe
to process paragraphs in parallel.