Efficiently processing ~50 million record file in python

Question:

We have a file with about 46 million records in CSV format. Each record has about 18 fields and one of them is a 64 byte ID. We have another file with about 167K unique IDs in it. The records corresponding to the IDs needs to be yanked out. So, we have written a python program that reads the 167K IDs into an array and processes the 46 million records file checking if the ID exists in each one of the those records. Here is the snippet of the code:

import csv
...
csvReadHandler = csv.reader(inputFile, delimiter=chr(1))
csvWriteHandler = csv.writer(outputFile, delimiter=chr(1), lineterminator='n')
for fieldAry in csvReadHandler:
    lineCounts['orig'] += 1
    if fieldAry[CUSTOMER_ID] not in idArray:
        csvWriteHandler.writerow(fieldAry)
        lineCounts['mod'] += 1

Tested the program on a small set of data, here are the processing times:

lines: 117929 process time: 236.388447046 sec
lines: 145390 process time: 277.075321913 sec

We have started running the program on the 46 million records file (which is about 13 GB size) last night ~3:00am EST, now it is around 10am EST and it is still processing!

Questions:

  1. Is there a better way to process these records to improve on processing times?
  2. Is python the right choice? Would awk or some other tool help?
  3. I am guessing 64 byte ID look-up on 167K array in the following statement is the culprit:
    if fieldAry[CUSTOMER_ID] not in idArray:

Is there a better alternative?

Thank you!

Update: This is processed on an EC2 instance with EBS attached volume.

Asked By: S.A

||

Answers:

The easiest way to speed things a bit would be to parallelize the line-processing with some distributed solution. On of the easiest would be to use multiprocessing.Pool. You should do something like this (syntax not checked):

from multiprocessing import Pool

p = Pool(processes=4)
p.map(process_row, csvReadHandler)

Despite that, python is not the best language to do this kind of batch processing (mainly because writing to disk is quite slow). It is better to leave all the disk-writing management (buffering, queueing, etc…) to the linux kernel so using a bash solution would be quite better. Most efficient way would be to split the input files into chunks and simply doing an inverse grep to filter the ids.

for file in $list_of_splitted_files; then
  cat $file | grep -v (id1|id2|id3|...) > $file.out
done;

if you need to merge afterwards simply:

for file in $(ls *.out); then
  cat $file >> final_results.csv
done

Considerations:

  • Don’t know if doing a single grep with all the ids is more/less
    efficient than looping through all the ids and doing a single-id
    grep.
  • When writing parallel solutions try to read/write to different
    files to minimize the I/O bottleneck (all threads trying to write to
    the same file) in all languages.
  • Put timers on your code for each processing part. This way you’ll see which part is wasting more time. I really recommend this because I had a similar program to write and I thought the processing part was the bottleneck (similar to your comparison with the ids vector) but in reality it was the I/O which was dragging down all the execution.
Answered By: Pol Alvarez Vecino

You should must use a set instead of a list; before the for loop do:

idArray = set(idArray)

csvReadHandler = csv.reader(inputFile, delimiter=chr(1))
csvWriteHandler = csv.writer(outputFile, delimiter=chr(1), lineterminator='n')
for fieldAry in csvReadHandler:
    lineCounts['orig'] += 1
    if fieldAry[CUSTOMER_ID] not in idArray:
        csvWriteHandler.writerow(fieldAry)
        lineCounts['mod'] += 1

And see the unbelievable speed-up; you’re using days worth of unnecessary processing time just because you chose the wrong data structure.


in operator with set has O(1) time complexity, whereas O(n) time complexity with list. This might sound like “not a big deal” but actually it is the bottleneck in your script. Even though set will have somewhat higher constants for that O. So your code is using something like 30000 more time on that single in operation than is necessary. If in the optimal version it’d require 30 seconds, now you spend 10 days on that single line alone.

See the following test: I generate 1 million IDs and take 10000 aside into another list – to_remove. I then do a for loop like you do, doing the in operation for each record:

import random
import timeit

all_ids = [random.randint(1, 2**63) for i in range(1000000)]
to_remove = all_ids[:10000]
random.shuffle(to_remove)
random.shuffle(all_ids)


def test_set():
    to_remove_set = set(to_remove)
    for i in all_ids:
        if i in to_remove_set:
            pass

def test_list():
    for i in all_ids:
        if i in to_remove:
            pass


print('starting')
print('testing list', timeit.timeit(test_list, number=1))
print('testing set', timeit.timeit(test_set, number=1))

And the results:

testing list 227.91903045598883
testing set 0.14897623099386692

It took 149 milliseconds for the set version; the list version required 228 seconds. Now these were small numbers: in your case you have 50 million input records against my 1 million; thus there you need to multiply the testing set time by 50: with your data set it would take about 7.5 seconds.

The list version, on the other hand, you need to multiply that time by 50 * 17 – not only are there 50 times more input records, but 17 times more records to match against. Thus we get 227 * 50 * 17 = 192950.

So your algorithm spends 2.2 days doing something that by using correct data structure can be done in 7.5 seconds. Of course this does not mean that you can scan the whole 50 GB document in 7.5 seconds, but it probably ain’t more than 2.2 days either. So we changed from:

             2 days                           2.2 days 
 |reading and writing the files||------- doing id in list ------|

to something like

             2 days            7.5 seconds (doing id in set)
 |reading and writing the files||

Lets start off with defining the bottle-necks you’ll encounter (some obvious, some not).

  • Harddrives – They’re slow and doesn’t cache large quantities
  • Re-reading the same data multiple times
  • Memory, you can’t store a 13GB file, or maybe you do and that’s an option?

To solve these, you could go multiple routes.

  1. One clear benefitial way would be to read the large data into a database (postgresql or mariadb for instance). But I’ll assume this isn’t at all possible for now.

In regards to a CSV reader, they’re fine but maybe not so efficient.
Since you’re going to read through both files anyway I would consider the following:

  1. Read the 13GB file line by line and save the ID in a dictionary without checking if the key/value exists. (why? Because checking if the value exists is slower than just overwriting it and dictionaries also has a added bonus that keys are unique so duplicates will be weeded out) or add it to a set() as described by many others.

    followed by reading the smaller file line by line and checking your dict or set if it contains the ID.

dict() vs set() vs list()

Here’s a comparison between the three data types set(), list() and dict():
code used: test.py

(11.719028949737549, 999950, 'Using dict() for inserts')
(1.8462610244750977, 'dict() time spent looking through the data')

(11.793760061264038, 999961, 'Using set() for inserts')
(7.019757986068726, 'set() time spent looking through the data')

(15min+, 'Using list()')  # Honestly, I never let it run to it's finish.. It's to slow.

As you can see dict is marginally quicker than set, while list() just falls behind completely (see description of why by Antti). I should point out my data is a bit skewed since it’s a quick and dirty demo of the speed difference, but the general idea should still come across.

Using dict() as a solution

So if you do not have access to a database version of the source-data, and you need to use Python, id go with something along the lines of:

delimiter = b'x01'
big_source = {}
with open('13_gig_source.log', 'rb') as fh:
    for line in fh:
        csv = line.split(delimiter)
        big_source[csv[4]] = None # 5:th column, change to match CUSTOMER_ID

output = open('result.log', 'wb')
with open('smaller_file.log', 'rb') as fh:
    for line in fh:
        csv = line.split(delimiter)
        if csv[4] in big_source:
            output.write(csv[4] + b'n')
output.close()

Since I didn’t know which column the data exists on i didn’t optimize the split().
For instance, if it’s the last column you’re trying to fetch, do line.rsplit(delimiter, 1)[-1] instead. Or if it’s the 3:d column do line.split(delimiter, 3)[2] because it will abort the process of looking for the next position of the delimiter in the split() functions.

Using linux tools

Yes, some tools might be better off suited for this such as awk because of the fact that it’s a specific tool written in C to perform a very specific task. Even tho Python is based on C it still has a lot of abstraction layers on-top of that C code and will for the most part be slower than the counterpart C tools written for specific tasks.

Now I have no data to test this with, nor am i a PRO With Nix-Commands or PWN-C for short. So I’ll leave someone else to give you examples of this, but i found this:

Answered By: Torxed

i think its better to use a database to solve this problem
first create a database like MySql or anything else
then write your data from file into 2 table
and finally use a simple sql query to select rows
somthing like:
select * from table1 where id in(select ids from table2)

Answered By: saeed safavi
Categories: questions Tags: , , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.