Performance of Redis vs Disk in caching application

Question:

I wanted to create a redis cache in python, and as any self respecting scientist I made a bench mark to test the performance.

Interestingly, redis did not fare so well. Either Python is doing something magic (storing the file) or my version of redis is stupendously slow.

I don’t know if this is because of the way my code is structured, or what, but I was expecting redis to do better than it did.

To make a redis cache, I set my binary data (in this case, an HTML page) to a key derived from the filename with an expiration of 5 minutes.

In all cases, file handling is done with f.read() (this is ~3x faster than f.readlines(), and I need the binary blob).

Is there something I’m missing in my comparison, or is Redis really no match for a disk? Is Python caching the file somewhere, and reaccessing it every time? Why is this so much faster than access to redis?

I’m using redis 2.8, python 2.7, and redis-py, all on a 64 bit Ubuntu system.

I do not think Python is doing anything particularly magical, as I made a function that stored the file data in a python object and yielded it forever.

I have four function calls that I grouped:

Reading the file X times

A function that is called to see if redis object is still in memory, load it, or cache new file (single and multiple redis instances).

A function that creates a generator that yields the result from the redis database (with single and multi instances of redis).

and finally, storing the file in memory and yielding it forever.

import redis
import time

def load_file(fp, fpKey, r, expiry):
    with open(fp, "rb") as f:
        data = f.read()
    p = r.pipeline()
    p.set(fpKey, data)
    p.expire(fpKey, expiry)
    p.execute()
    return data

def cache_or_get_gen(fp, expiry=300, r=redis.Redis(db=5)):
    fpKey = "cached:"+fp

    while True:
        yield load_file(fp, fpKey, r, expiry)
        t = time.time()
        while time.time() - t - expiry < 0:
            yield r.get(fpKey)


def cache_or_get(fp, expiry=300, r=redis.Redis(db=5)):

    fpKey = "cached:"+fp

    if r.exists(fpKey):
        return r.get(fpKey)

    else:
        with open(fp, "rb") as f:
            data = f.read()
        p = r.pipeline()
        p.set(fpKey, data)
        p.expire(fpKey, expiry)
        p.execute()
        return data

def mem_cache(fp):
    with open(fp, "rb") as f:
        data = f.readlines()
    while True:
        yield data

def stressTest(fp, trials = 10000):

    # Read the file x number of times
    a = time.time()
    for x in range(trials):
        with open(fp, "rb") as f:
            data = f.read()
    b = time.time()
    readAvg = trials/(b-a)


    # Generator version

    # Read the file, cache it, read it with a new instance each time
    a = time.time()
    gen = cache_or_get_gen(fp)
    for x in range(trials):
        data = next(gen)
    b = time.time()
    cachedAvgGen = trials/(b-a)

    # Read file, cache it, pass in redis instance each time
    a = time.time()
    r = redis.Redis(db=6)
    gen = cache_or_get_gen(fp, r=r)
    for x in range(trials):
        data = next(gen)
    b = time.time()
    inCachedAvgGen = trials/(b-a)


    # Non generator version    

    # Read the file, cache it, read it with a new instance each time
    a = time.time()
    for x in range(trials):
        data = cache_or_get(fp)
    b = time.time()
    cachedAvg = trials/(b-a)

    # Read file, cache it, pass in redis instance each time
    a = time.time()
    r = redis.Redis(db=6)
    for x in range(trials):
        data = cache_or_get(fp, r=r)
    b = time.time()
    inCachedAvg = trials/(b-a)

    # Read file, cache it in python object
    a = time.time()
    for x in range(trials):
        data = mem_cache(fp)
    b = time.time()
    memCachedAvg = trials/(b-a)


    print "n%s file reads: %.2f reads/secondn" %(trials, readAvg)
    print "Yielding from generators for data:"
    print "multi redis instance: %.2f reads/second (%.2f percent)" %(cachedAvgGen, (100*(cachedAvgGen-readAvg)/(readAvg)))
    print "single redis instance: %.2f reads/second (%.2f percent)" %(inCachedAvgGen, (100*(inCachedAvgGen-readAvg)/(readAvg)))
    print "Function calls to get data:"
    print "multi redis instance: %.2f reads/second (%.2f percent)" %(cachedAvg, (100*(cachedAvg-readAvg)/(readAvg)))
    print "single redis instance: %.2f reads/second (%.2f percent)" %(inCachedAvg, (100*(inCachedAvg-readAvg)/(readAvg)))
    print "python cached object: %.2f reads/second (%.2f percent)" %(memCachedAvg, (100*(memCachedAvg-readAvg)/(readAvg)))

if __name__ == "__main__":
    fileToRead = "templates/index.html"

    stressTest(fileToRead)

And now the results:

10000 file reads: 30971.94 reads/second

Yielding from generators for data:
multi redis instance: 8489.28 reads/second (-72.59 percent)
single redis instance: 8801.73 reads/second (-71.58 percent)
Function calls to get data:
multi redis instance: 5396.81 reads/second (-82.58 percent)
single redis instance: 5419.19 reads/second (-82.50 percent)
python cached object: 1522765.03 reads/second (4816.60 percent)

The results are interesting in that a) generators are faster than calling functions each time, b) redis is slower than reading from the disk, and c) reading from python objects is ridiculously fast.

Why would reading from a disk be so much faster than reading from an in-memory file from redis?

EDIT:
Some more information and tests.

I replaced the function to

data = r.get(fpKey)
if data:
    return r.get(fpKey)

The results do not differ much from

if r.exists(fpKey):
    data = r.get(fpKey)


Function calls to get data using r.exists as test
multi redis instance: 5320.51 reads/second (-82.34 percent)
single redis instance: 5308.33 reads/second (-82.38 percent)
python cached object: 1494123.68 reads/second (5348.17 percent)


Function calls to get data using if data as test
multi redis instance: 8540.91 reads/second (-71.25 percent)
single redis instance: 7888.24 reads/second (-73.45 percent)
python cached object: 1520226.17 reads/second (5132.01 percent)

Creating a new redis instance on each function call actually does not have a noticable affect on read speed, the variability from test to test is larger than the gain.

Sripathi Krishnan suggested implementing random file reads. This is where caching starts to really help, as we can see from these results.

Total number of files: 700

10000 file reads: 274.28 reads/second

Yielding from generators for data:
multi redis instance: 15393.30 reads/second (5512.32 percent)
single redis instance: 13228.62 reads/second (4723.09 percent)
Function calls to get data:
multi redis instance: 11213.54 reads/second (3988.40 percent)
single redis instance: 14420.15 reads/second (5157.52 percent)
python cached object: 607649.98 reads/second (221446.26 percent)

There is a HUGE amount of variability in file reads so the percent difference is not a good indicator of speedup.

Total number of files: 700

40000 file reads: 1168.23 reads/second

Yielding from generators for data:
multi redis instance: 14900.80 reads/second (1175.50 percent)
single redis instance: 14318.28 reads/second (1125.64 percent)
Function calls to get data:
multi redis instance: 13563.36 reads/second (1061.02 percent)
single redis instance: 13486.05 reads/second (1054.40 percent)
python cached object: 587785.35 reads/second (50214.25 percent)

I used random.choice(fileList) to randomly select a new file on each pass through the functions.

The full gist is here if anyone would like to try it out – https://gist.github.com/3885957

Edit edit:
Did not realize that I was calling one single file for the generators (although the performance of the function call and generator was very similar). Here is the result of different files from the generator as well.

Total number of files: 700
10000 file reads: 284.48 reads/second

Yielding from generators for data:
single redis instance: 11627.56 reads/second (3987.36 percent)

Function calls to get data:
single redis instance: 14615.83 reads/second (5037.81 percent)

python cached object: 580285.56 reads/second (203884.21 percent)
Asked By: MercuryRising

||

Answers:

This is an apples to oranges comparison.
See http://redis.io/topics/benchmarks

Redis is an efficient remote data store. Each time a command is executed on Redis, a message is sent to the Redis server, and if the client is synchronous, it blocks waiting for the reply. So beyond the cost of the command itself, you will pay for a network roundtrip or an IPC.

On modern hardware, network roundtrips or IPCs are suprisingly expensive compared to other operations. This is due to several factors:

  • the raw latency of the medium (mainly for network)
  • the latency of the operating system scheduler (not guaranteed on Linux/Unix)
  • memory cache misses are expensive, and the probability of cache misses increases while the client and server processes are scheduled in/out.
  • on high-end boxes, NUMA side effects

Now, let’s review the results.

Comparing the implementation using generators and the one using function calls, they do not generate the same number of roundtrips to Redis. With the generator you simply have:

    while time.time() - t - expiry < 0:
        yield r.get(fpKey)

So 1 roundtrip per iteration. With the function, you have:

if r.exists(fpKey):
    return r.get(fpKey)

So 2 roundtrips per iteration. No wonder the generator is faster.

Of course you are supposed to reuse the same Redis connection for optimal performance. There is no point to run a benchmark which systematically connects/disconnects.

Finally, regarding the performance difference between Redis calls and the file reads, you are simply comparing a local call to a remote one. File reads are cached by the OS filesystem, so they are fast memory transfer operations between the kernel and Python. There is no disk I/O involved here. With Redis, you have to pay for the cost of the roundtrips, so it is much slower.

Answered By: Didier Spezia