How to insert Billion of data to Redis efficiently?

Question

I have around 2 billion key-value pairs and I want to load them into Redis efficiently. I am currently using Python and used Pipe as documented by the redis-py. How can I speed the following approach up?

import redis

def load(pdt_dict):
    """
    Load data into redis.

    Parameters
    ----------
    pdt_dict : Dict[str, str]
        To be stored in Redis
    """
    redIs = redis.Redis()
    pipe = redIs.pipeline()
    for key in pdt_dict.keys():
        pipe.hmset(self.seller + ":" + str(key), pdt_dict[key])
    pipe.execute()

Asked By: John Deep

||

Source

Answer 1

A few points regarding the question and sample code.

Pipelining isn’t a silver bullet – you need to understand what it does before you use it. What pipelining does is batch several operations that are sent as bulk, as is their response from the server. What you gain is that the network round trip time for each operation is replaced by that of the batch. But infinitely-sized batches are a real drain on resource – you need to keep their size small enough to be effective. As a rule of thumb I usually try to aim to 60KB per pipeline and since every data is different, so does the number of actual operations in a pipeline. Assuming that your key and its value are ~1KB, you need to call pipeline.execute() every 60 operations or so.
Unless I grossly misunderstand, this code shouldn’t run. You’re using HMSET as if it is SET, so you’re basically missing the field->value mapping of Hashes. Hashs (HMSET) and Strings (SET) are different data types and should therefore be used accordingly.
It appears as if this one little loop is in charge of the entire “Billion of data” – if that is the case, not only would your server running the code be swapping like crazy unless it has a lot of RAM to hold the dictionary, it would also be very ineffective (regardless Python’s speed). You need to parallelize the data insertion by running multiple instances of this process.
Are you connecting to Redis remotely? If so, the network may be limiting your performance.
Consider your Redis’ settings – perhaps these can be tweaked/tuned for better performance for this task, assuming that it is indeed a bottleneck.

Answered By: Itamar Haber

Answer 2

I hope you’ve installed hiredis python package next to redis python package too. See https://github.com/andymccurdy/redis-py#parsers It should give you a performance boost too.

What did self.seller do? Maybe this is a bottleneck?

As @Itamar said, try to execute the pipeline periodically

def load(pdtDict):
    redIs = redis.Redis()
    pipe = redIs.pipeline()
    n = 1
    for key in pdtDict.keys():
        pipe.hmset(self.seller+":"+str(key),pdtDict[key])
        n = n + 1
        if (n % 64) == 0:
            pipe.execute()
            pipe = redIs.pipeline()

Answered By: Markus

Answer 3

To feed large volumes of data to Redis consider using the redis mass insertion feature described here.

For this to work you’ll need to have access to redis-cli.

Answered By: Zwitterion

Answer 4

You could use redis-cli in pipe mode.

First you prepare a file like (note that the lines should be terminated by cr/lf or set by -d <delimiter> option):

    SET Key0 Value0
    SET Key1 Value1
    ...
    SET KeyN ValueN

Then serialize it converting to Redis RESP format (e.g. as a quoted string, see the docs).
Finally pipe it to redis-cli (with the --pipe arg):

cat data_in_resp_format.txt | redis-cli --pipe

Answered By: stacker

Answer 5

Another consideration, setting transaction=False in your pipeline construction can help provide performance increase if the following conditions apply (from Redis Labs):

For situations where we want to send more than one command to Redis,
the result of one command doesn’t affect the input to another, and we
don’t need them all to execute transactionally, passing False to the
pipeline() method can further improve overall Redis performance.

Answered By: Brendan

How to insert Billion of data to Redis efficiently?

Question:

Answers: