How can I find intersection of two large file efficiently using python?

Question:

I have two large files. Their contents looks like this:

134430513
125296589
151963957
125296589

The file contains an unsorted list of ids. Some ids may appear more than one time in a single file.

Now I want to find the intersection part of two files. That is the ids appear in both files.

I just read the two files into 2 sets, s1 and s2. And get the intersection by s1.intersection(s2) . But it consumes a lot of memory and seems slow.

So is there any better or pythonic way to do this? If the file contains so many ids that can not be read into a set with limited memory, what can I do?

EDIT: I read the file into 2 sets using a generator:

def id_gen(path):
    for line in open(path):
        tmp = line.split()
        yield int(tmp[0])

c1 = id_gen(path)
s1 = set(c1)

All of the ids are numeric. And the max id may be 5000000000. If use bitarray, it will consume more memory.

Asked By: amazingjxq

||

Answers:

set(open(file1)) & set(open(file2))

which is equivalent to using intersection, is the most Pythonic way. You might be able to speed it up by doing

set(int(x) for x in open(file1)) & set(int(x) for x in open(file2))

since then you’ll be storing and comparing integers rather than strings. This only works if all ids are numeric, of course.

If it’s still not fast enough, you can turn to a slightly more imperative style:

# heuristic: set smaller_file and larger_file by checking the file size
a = set(int(x) for x in open(smaller_file))
# note: we're storing strings in r
r = set(x for x in open(larger_file) if int(x) in a)

If both files are guaranteed not to contain duplicates, you can also use a list to speed things up:

a = set(int(x) for x in open(smaller_file))
r = [x for x in open(larger_file) if int(x) in a]

Be sure to measure various solutions, and check whether you’re not actually waiting for disk or network input.

Answered By: Fred Foo

You need not create both s1 and s2. First read in the lines from the first file, convert each line to integer (saves memory), put it in s1. Then for each line in the second file, convert it to integer, and check if this value is in s1.

That way, you’ll save memory from storing strings, and from having two sets.

Answered By: Nam Nguyen

So the algorithm is not necessarily tied to python, but rather generic if you cannot represent all ids in a set in memory. If the range of the integers is limited, an approach would be to use a large bitarray. Now you read the first file and set the integer in the bitarray to be present.
Now you read the second file, and output all numbers that are also present in the bitarray.

If even this is not sufficient, you can split the range using multiple sweeps. I.e. in the first pass, you only consider integers smaller than 0x200000000 (1GB bitarray).
Then you reset the bitarray and read the files again only considering integers from 0x200000000 to 0x400000000 (and substract 0x200000000 before handling the integer).

This way you can handle LARGE amounts of data, with reasonable runtime.

A sample for single sweep would be:

import bitarray
r = bitarray.bitarray(5000000000)

for line in open(file1):
    r[int(line)] = True

for line in open(file2):
    if r[int(line)]:
        print line
Answered By: rumpel

AFAIK there is no efficient way to do this with Python, especially if you are dealing with massive amounts of data.

I like rumpel‘s solution. But please note that bitarray is a C extension.

I would use shell commands to handle this. You can pre-process files to save time & space:

sort -u file1 file1.sorted
sort -u file2 file2.sorted

Then you can use diff to find out the similarities:

diff --changed-group-format='' --unchanged-group-format='%=' file1.sorted file2.sorted

Of course it is possible to combine everything into a single command, without creating intermediary files.

UPDATE

According to Can‘s recommendation, comm is the more appropriate command:

sort -u file1 file1.sorted
sort -u file2 file2.sorted
comm -12 file1.sorted file2.sorted
Answered By: muhuk

Others have shown the more idiomatic ways of doing this in
Python, but if the size of the data really is too big, you can
use the system utilities to sort and eliminate duplicates, then
use the fact that a File is an iterator which returns one line
at a time, doing something like:

import os
os.system('sort -u -n s1.num > s1.ns')
os.system('sort -u -n s2.num > s2.ns')
i1 = open('s1.ns', 'r')
i2 = open('s2.ns', 'r')
try:
    d1 = i1.next()
    d2 = i2.next()
    while True:
        if (d1 < d2):
            d1 = i1.next()
        elif (d2 < d1):
            d2 = i2.next()
        else:
            print d1,
            d1 = i1.next()
            d2 = i2.next()
except StopIteration:
    pass

This avoids having more than one line at a time (for each file)
in memory (and the system sort should be faster than anything
Python can do, as it is optimized for this one task).

Answered By: James Kanze

for data larger then memory, you can split your data file into 10 files, which contain the same lowest digital.

so all ids in s1.txt that ends with 0 will be saved in s1_0.txt.

Then use set() to find the intersection of s1_0.txt and s2_0.txt, s1_1.txt and s2_1.txt, …

Answered By: HYRY

I have encountered the same error, I was having two files one for 1GB and another one for 1.5GB around, and when I tried to store it in the set I got an OutOfMemory Heap Size error.

So I have divided these files into 100 small files based on their last two digits so temp1_00 is containing the id with the last two digits as 00 for the first file and temp2_00 is containing the id with the last two digits as 00 for the second file, then I calculated intersection using set for all 100 files and summed it up.

Further Optimization:
The task I was working on just need some approximate value so I have just calculated for temp1_00 and temp2_00 and multiplied the result by 100 to get an approximate result. The error was around 0-5% that I can afford for that large dataset.
If you need a more accurate result then you can calculate for 10 files and multiply the result by 10.

Stats:

  • Actual Size ( taking all 100 files ): 82,595,165 ( 313 Seconds )
  • Approximate Size ( taking 10 files ): 82,574,530 ( 31 Seconds )
  • Approximate Size ( taking 1 file only): 85,892,000 ( 5 Seconds )
Answered By: vishal singh
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.