How to extract cities with Spacy / Can't load French model

Question:

I know it’s perhaps an easy question but i’m not very familiar with Spacy.

So i’m trying to extract cities in a text file.

My code is that :

pip install spacy_lefff
pip install spacy download fr

import spacy
from spacy_lefff import LefffLemmatizer
from spacy.language import Language

@Language.factory('french_lemmatizer')
def create_french_lemmatizer(nlp, name):
    #return LefffLemmatizer()

nlp = spacy.load('fr_core_news_sm')
nlp.add_pipe('french_lemmatizer', name='lefff')
doc = nlp(u"Apple cherche a acheter une startup anglaise pour 1 milliard de dollard")
for d in doc:
    print(d.text, d.pos_, d._.lefff_lemma, d.tag_, d.lemma_)

import spacy
nlp = spacy.load("en_core_web_sm")

import os
from google.colab import drive
drive.mount('/content/drive/', force_remount=True)
if not os.path.exists('/content/drive/My Drive/Miserables'):
  os.makedirs('/content/drive/My Drive/Miserables')

root_dir = '/content/drive/My Drive/Miserables/'
os.listdir('/content/drive/My Drive/Miserables')
with open("/content/drive/My Drive/Miserables/miserable.txt", 'r') as f:
     myString = f.read()

doc = nlp(open('/content/drive/My Drive/Miserables/miserable - 1.txt').read())
for ent in doc.ents:
    if (ent.label_ == 'GPE'):
        gpe.append(ent.text)
    elif (ent.label_ == 'LOC'):
        loc.append(ent.text)

cities = []
countries = []
other_places = []
import wikipedia
for text in gpe:
    summary = str(wikipedia.summary(text),"html.parser")
    if ('city' in summary):
        cities.append(text)
    elif ('country' in summary):
        countries.append(text)
    else:
        other_places.append(text)

for text in loc:
    other_places.append(text)


TypeError: decoding str is not supported
Can’t load french spacy model? I don’t know why, i’m trying but it doesnt code.

Thanks for your help.

Asked By: NoobWithPython

||

Answers:

My text is in French

I just skimmed some of the source code for locationtagger, and it appears that it hardcodes usage of the en_core_web_sm model. It likely does not form correct parses of your input text.

I would not use nltk or locationtagger for this task.

Instead, download a proper spaCy model for French:

python3 -m spacy download fr_core_news_{sm|md|lg|trf}

Read spaCy’s documentation on named entity recognition [1]. This includes information about identifying geopolitical entities ("GPE").

The default spaCy models will tag cities, states/provinces/districts, and countries under the "GPE" tag. If you are interested only in the cities, then, you should filter the found GPEs against the data in locationtagger‘s City-Region-Locations.csv.

Additionally, you may wish to segment the text by paragraph and use spaCy’s nlp.pipe to process paragraphs in parallel.

Answered By: Andrew Parsons
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.