how to remove duplicate lines of a huge file in python

Question:

I have a txt file around 32GB and need to check whether there are some duplicate lines or not.
What is the best way to remove duplicate lines of a huge text file without reading it line by line?

Asked By: dinan

||

Answers:

You can make a set of the lines seen so far, and to save memory, store a hash of each line instead of the line itself (at the cost of a minuscule chance of false positive):

seen = set()
with open('src-text.txt', 'r') as fin, open('src-text-unique.txt', 'w') as fout:
    for line in fin:
        h = hash(line)
        if h not in seen:
            fout.write(line)
            seen.add(h)

Notes:

  • if you are worried about the 64-bit hash collision, you may use a different hash, such as hashlib.md5() or hashlib.sha256() instead.
  • if instead you don’t have enough memory for the hashes of lines, you may look at a BloomFilter instead for a finite use of memory (at the expense of a higher false positive rate).
  • as a third alternative, somewhat inspired by @dawg’s idea of "decorating the lines", if you are really too tight in memory for either storing all the hashes or a large-enough BloomFilter, you could split your file in n_parts, according to hash(line) % n_parts; in these temp files, store the original line number along with each line. Then apply deduplication for each part separately, and then merge (the already-sorted part-files). This avoids the n log n sort and is instead O(n). That technique, however, is not applicable to stream processing, whereas the first two (set of hashes or BloomFilter) are, until the number of distinct items is too large and leads to either MemoryError (set of hashes) or too large a false positive rate (BloomFilter).

Addendum: Short analysis of collision rates and memory size:

Let’s say that your 32GB file has lines of 60 chars on average, and 10% of the lines are duplicates (90% distinct lines).

That would lead to:

n_distinct = int(0.9 * 32 * 1024**3 / 60)
>>> n_distinct
515_396_075

In other words, about 1/2 billion distinct lines. That’s the number of hashes we’d have to store in memory, and the n value in a BloomFilter analysis.

Size and collision probability of hashes:

Referring to the well-known formulas for estimating the probability of collision described in Birthday attack, and using the first two terms of the Taylor expansion of 1 – e-x ≈ x – x2/2, we get a probability of collision for well-distributed hashes of n bits and k distinct items as:

def prob_collision(n_bits, n_distinct):
    n, k = n_bits, n_distinct  # usual notation
    x = k**2 / 2**(n + 1)
    return x - x**2 / 2  # approx of 1 - exp(-x)
  • Using the built-in hash(), each hash is a 64-bit int: 8 bytes. So we’d need 3.8GiB in memory. The probability of at least one collision (meaning: at least one line wrongly identified as having been seen when in fact it was not) in the entire file is 0.7%.
  • Using md5 (128 bits), you’d need double the memory, 7.7GiB, but get a collision probability of 3.9*10-22.

Size and collision probability of Bloom Filter:

A Bloom Filter with m bits and intended to store up to n items should use the optimal number of hash functions that minimizes the rate of false positives: k = m / n * log(2). With that number of functions, the false positive rate (FPR) is roughly: (1 – e-kn/m)k. The probability of at least one collision is (much) less than FPR * n (it is roughly the integral of FPR for n in 0..n_distinct).

def bloom_k(m, n):
    return int(np.ceil(m / n * np.log(2)))

def bloom_false_positive_rate(m, n, k=None):
    k = bloom_k(m, n) if k is None else k
    fpos = np.power(1 - np.exp(-k*n/m), k)
    return fpos

def bloom_prob_collision(m, n, k=None):
    # too lazy to figure out the integral, so
    # using this instead which is slow as molasses
    k = bloom_k(m, n) if k is None else k
    i = np.arange(0, n + 1)
    p_coll = np.sum(np.power(1 - np.exp(-k*i/m), k))
    return p_coll
  • Using the same amount of memory as what n_distinct 64-bit hashes indicated above (3.8Gib), we get m = 32_641_751_449 bits, k = 44 hashing functions, and a probability of at least one collision that is much smaller: roughly 0.0001%.
Answered By: Pierre D

As stated in comments, your issue is processing a file that is bigger than your computer’s memory. Deduplicating is usually trivial.

The best way to handle this is likely a decorate, sort, undecorate "DSU" approach with:

  1. Create a TEMP file and Decorate each line with:

    a) A hash large enough to identify duplicates — the built-in hash() is great of SHA 256 is likely bullet proof;

    b) The line number.

  2. Sort based on the hash key using one of the many algorithms to sort large files

  3. Eliminate the proximate duplicates found

  4. Re sort based on the line number

  5. Undecorate back to the new file.

This assumes that a hash of each line does not fit in memory. If that hash DOES fit in memory, use Pierre D’s approach.

Answered By: dawg

Memory Error Fix For @Pierre D code

import re

seen = set()
pattern = re.compile(r'^d+s+')

with open('file.txt', 'r', encoding='utf-8') as fin, open('output.txt', 'w', encoding='utf-8') as fout:
    for line in fin:
        if not pattern.match(line):
            h = hash(line)
            if h not in seen:
                fout.write(line)
                seen.add(h)
Answered By: moooon
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.