Getting address with geopy takes too long

Question:

In the context of a project, I have hydrated 1.6 million tweets, i.e retrieved the metadata associated with the tweets such as date of creation, and location.

My tweet dataset contains tweets from all over the world, however, I am only interested in tweets created in the US.
Also, I want to create some statistics by state, and since most of the locations associated with the tweets are wrong or not formalised, I need to formalise them before I do so.

Here are the kind of locations that I have:
[‘Ìïú͵≠Ïñ¥ Í∞ïφú ÏàòÏö©ÏÜå (DPRK)’, ‘Lagos, Nigeria’, ‘Kolkata, India’, ‘Who cares’, ‘Unknown’, ‘British Columbia, Canada’, ‘Bitcoin & Markets’, ‘White Plains, NY’, ‘Washington, DC’]

I was able to create a code that filters all these locations and formalises them, but the problem is that it takes way too long (2.03it/s), which means that it would take me between 8 and 9 days to formalise my locations.

I am looking to speed up this process

In the beginning, my df looked like this:

enter image description here

Here is the code that I used, I only tried it on a sample since the process is slow:

from geopy import geocoders  
geolocator = geocoders.Nominatim(user_agent='myapplication')

from tqdm.auto import tqdm
tqdm.pandas()
def get_adress(x):
    try:
        return geolocator.geocode(x).address
    except:
        return ""

df_s = df.sample(1000)
df_s["new_loc"] = df_s.user_location.progress_apply(get_adress)
df_s["country"] = df_s.new_loc.apply(lambda x: x.split(",")[-1])

df_s = df_s[df_s.country.apply(lambda x: "United States" in x)]

df_s = df_s[df_s.new_loc.apply(lambda x: len(x.split(","))) > 1]

In the end, my df looked like this, which is what I wanted:

enter image description here

Is there a way to do this faster???

Asked By: lifrah

||

Answers:

Per the docs, geopy is a client for calling various third party services, i.e. it is making network calls on your behalf to services that may be metered.

This is always going to be a very slow process if you want to make millions of API calls. It costs money to provide those services, so you have to be reasonable about your use of the free ones (making millions of requests in a few minutes would not be reasonable).

I quote:

Different services have different Terms of Use, quotas, pricing, geodatabases and so on. For example, Nominatim is free, but provides low request limits. If you need to make more queries, consider using another (probably paid) service, such as OpenMapQuest or PickPoint (these two are commercial providers of Nominatim, so they should have the same data and APIs). Or, if you are ready to wait, you can try geopy.extra.rate_limiter.

So that gives you a few different approaches you could use. I would suggest checking out the pricing for the paid services and seeing what guarantees they impose.

Even then, API calls are always going to be slow-ish as they traverse the internet. You may also need to adopt some parallelization, depending on your requirements.

Answered By: Paddy Alton