Most straightforward way to cache geocoding data

Question:

I am using geopy to get lat/long coordinates for a list of addresses. All the documentation points to limiting server queries by caching (many questions here, in fact), but few actually give practical solutions.

What is the best way to accomplish this?

This is for a self-contained data processing job I’m working on … no app platform involved. Just trying to cut down on server queries as I run through data that I will have seen before (very likely, in my case).

My code looks like this:

from geopy import geocoders
def geocode( address ):
    # address ~= "175 5th Avenue NYC"
    g = geocoders.GoogleV3()
    cache = addressCached( address )

    if ( cache != False ): 
        # We have seen this exact address before,
        # return the saved location
        return cache

    # Otherwise, get a new location from geocoder
    location = g.geocode( address )

    saveToCache( address, location )
    return location

def addressCached( address ):
    # What does this look like?

def saveToCache( address, location ):
    # What does this look like?
Asked By: cslstr

||

Answers:

How exactly you want to implement your cache is really dependent on what platform your Python code will be running on.

You want a pretty persistent “cache” since addresses’ locations are not going to change often:-), so a database (in a key-value mood) seems best.

So in many cases I’d pick sqlite3, an excellent, very lightweight SQL engine that’s part of the Python standard library. Unless perhaps I preferred e.g a MySQL instance that I need to have running anyway, one advantage might be that this would allow multiple applications running on different nodes to share the “cache” — other DBs, both SQL and non, would be good for the latter, depending on your constraints and preferences.

But if I was e.g running on Google App Engine, then I’d be using the datastore it includes, instead. Unless I had specific reasons to want to share the “cache” among multiple disparate applications, in which case I might consider alternatives such as google cloud sql and google storage, as well as another alternative yet consisting of a dedicated “cache server” GAE app of my own serving RESTful results (maybe w/endpoints?). The choice is, again!, very, very dependent on your constraints and preferences (latency, queries-per-seconds sizing, etc, etc).

So please clarify what platform you are in, and what other constraints and preferences you have for your databasey “cache”, and then the very simple code to implement that can easily be shown. But showing half a dozen different possibilities before you clarify would not be very productive.

Added: since the comments suggest sqlite3 may be acceptable, and there are a few important details best shown in code (such as, how to serialize and deserialize an instance of geopy.location.Location into/from a sqlite3 blob — similar issues may well arise with other underlying databases, and the solutions are similar), I decided a solution example may be best shown in code. So, as the “geo cache” is clearly best implemented as its own module, I wrote the following simple geocache.py…:

import geopy
import pickle
import sqlite3

class Cache(object):
    def __init__(self, fn='cache.db'):
       self.conn = conn = sqlite3.connect(fn)
       cur = conn.cursor()
       cur.execute('CREATE TABLE IF NOT EXISTS '
                   'Geo ( '
                   'address STRING PRIMARY KEY, '
                   'location BLOB '
                   ')')
       conn.commit()

    def address_cached(self, address):
        cur = self.conn.cursor()
        cur.execute('SELECT location FROM Geo WHERE address=?', (address,))
        res = cur.fetchone()
        if res is None: return False
        return pickle.loads(res[0])

    def save_to_cache(self, address, location):
        cur = self.conn.cursor()
        cur.execute('INSERT INTO Geo(address, location) VALUES(?, ?)',
                    (address, sqlite3.Binary(pickle.dumps(location, -1))))
        self.conn.commit()


if __name__ == '__main__':
    # run a small test in this case
    import pprint

    cache = Cache('test.db')
    address = '1 Murphy St, Sunnyvale, CA'
    location = cache.address_cached(address)
    if location:
        print('was cached: {}n{}'.format(location, pprint.pformat(location.raw)))
    else:
        print('was not cached, looking up and caching now')
        g = geopy.geocoders.GoogleV3()
        location = g.geocode(address)
        print('found as: {}n{}'.format(location, pprint.pformat(location.raw)))
        cache.save_to_cache(address, location)
        print('... and now cached.')

I hope the ideas illustrated here are clear enough — there are alternatives on each design choice, but I’ve tried to keep things simple (in particular, I’m using a simple example-cum-mini-test when this module is run directly, in lieu of a proper suite of unit-tests…).

For the bit about serializing to/from blobs, I’ve chosen pickle with the “highest protocol” (-1) protocol — cPickle of course would be just as good in Python 2 (and faster:-) but these days I try to write code that’s equally good as Python 2 or 3, unless I have specific reasons to do otherwise:-). And of course I’m using a different filename test.db for the sqlite database used in the test, so you can wipe it out with no qualms to test some variation, while the default filename meant to be used in “production” code stays intact (it is quite a dubious design choice to use a filename that’s relative — meaning “in the current directory” — but the appropriate way to decide where to place such a file is quite platform dependent, and I didn’t want to get into such exoterica here:-).

If any other question is left, please ask (perhaps best on a separate new question since this answer has already grown so big!-).

Answered By: Alex Martelli

How about creating a list or dict in which all geocoded addresses are stored? Then you could simply check.

if address in cached:
    //skip
Answered By: Klaster

This cache will live from the moment the module is loaded and won’t be saved after you finish using this module. You’ll probably want to save it into a file with pickle or into a database and load it next time you load the module.

from geopy import geocoders
cache = {}

def geocode( address ):
    # address ~= "175 5th Avenue NYC"
    g = geocoders.GoogleV3()
    cache = addressCached( address )

    if ( cache != False ): 
        # We have seen this exact address before,
        # return the saved location
        return cache

    # Otherwise, get a new location from geocoder
    location = g.geocode( address )

    saveToCache( address, location )
    return location

def addressCached( address ):
    global cache
    if address in cache:
        return cache[address]
    return None

def saveToCache( address, location ):
    global cache
    cache[address] = location
Answered By: bpgergo

Here is a simple implementation that uses the python shelve package for transparent and persistent caching:

import geopy
import shelve
import time

class CachedGeocoder:
    def __init__(self, source = "Nominatim", geocache = "geocache.db"):
        self.geocoder = getattr(geopy.geocoders, source)()
        self.db = shelve.open(geocache, writeback = True)
        self.ts = time.time()+1.1
    def geocode(self, address):
        if not address in self.db:
            time.sleep(max(1 -(time.time() - self.ts), 0))
            self.ts = time.time()
            self.db[address] = self.geocoder.geocode(address)
        return self.db[address]

geocoder = CachedGeocoder()
print geocoder.geocode("San Francisco, USA")

It stores a timestamp to ensure that requests are not issued more frequently than once per second (which is a requirement for Nominatim). One weakness is that doesn’t deal with timed out responses from Nominatim.

Answered By: Stefan Farestam

The easiest way to cache geocoding requests is probably to use requests-cache:

import geocoder
import requests_cache

requests_cache.install_cache('geocoder_cache')

g = geocoder.osm('Mountain View, CA') # <-- This request will go to OpenStreetMap server.
print(g.latlng)
# [37.3893889, -122.0832101]

g = geocoder.osm('Mountain View, CA') # <-- This request should be cached, and return immediatly
print(g.latlng)
# [37.3893889, -122.0832101]

requests_cache.uninstall_cache()

Just for debugging purposes, you could check if requests are indeed cached:

import geocoder
import requests_cache

def debug(response):
    print(type(response))
    return True

requests_cache.install_cache('geocoder_cache2',  filter_fn=debug)

g = geocoder.osm('Mountain View, CA')
# <class 'requests.models.Response'>
# <class 'requests.models.Response'>

g = geocoder.osm('Mountain View, CA')
# <class 'requests_cache.models.response.CachedResponse'>

requests_cache.uninstall_cache()
Answered By: Eric Duminil
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.