How to prevent filling up memory when hashing large files with xxhash?

Question:

I’m trying to calculate xxhash of video files using the following code:

def get_hash(file):
    with open(file, 'rb') as input_file:
        return xxhash.xxh3_64(input_file.read()).hexdigest()

Some of the files are larger than the amount of RAM on the machine. When hashing those files, the memory fills up, followed by swap filling up, at which point the process gets killed by the OS (I assume).
What is the correct way to handle these situations? Thank you!

Asked By: equinoxe5

||

Answers:

Instead of hashing the entire contents in one go, read it in chunks and update the hash as you read. Once you’ve used a chunk, you can discard it.

from functools import partial

def get_hash(file):
    CHUNK_SIZE = 2 ** 32  # Or whatever you have memory to handle
    with open(file, 'rb') as input_file:
        x = xxhash.xxh3_64()
        for chunk in iter(partial(input_file.read, CHUNK_SIZE), b''):
            x.update(chunk)
        return x.hexdigest()
 
Answered By: chepner

you can read the input file in small chunks, rather than reading the entire file into memory at once.

import xxhash
    
    
    def read_in_chunks(file_object, chunk_size=4096):
      while True:
        data = file_object.read(chunk_size)
        if not data:
          break
        yield data
    
    
    def get_hash(file):
      h = xxhash.xxh3_64()
      with open(file, 'rb') as input_file:
        for chunk in read_in_chunks(input_file):
          h.update(chunk)
      return h.hexdigest()
Answered By: Mustafa Walid
Categories: questions Tags:
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.