Python in-memory files for caching large files

Question:

I am doing very large data processing (16GB) and I would like to try and speed up the operations through storing the entire file in RAM in order to deal with disk latency.

I looked into the existing libraries but couldn’t find anything that would give me the flexibility of interface.

Ideally, I would like to use something with integrated read_line() method so that the behavior is similar to the standard file reading interface.

Asked By: AdmiralFishHead

||

Answers:

Just read the file once. The OS will cache it for the next read operations. It’s called the page cache.

For example, run this code and see what happens (replace the path with one pointing to a file on your system):

import time

filepath = '/home/mostafa/Videos/00053.MTS'
n_bytes = 200_000_000  # read 200 MB

t0 = time.perf_counter()
with open(filepath, 'rb') as f:
    f.read(n_bytes)
t1 = time.perf_counter()
with open(filepath, 'rb') as f:
    f.read(n_bytes)
t2 = time.perf_counter()

print(f'duration 1 = {t1 - t0}')
print(f'duration 2 = {t2 - t1}')

On my Linux system, I get:

duration 1 = 2.1419905399670824
duration 2 = 0.10992361599346623

Note how faster the second read operation is, even though the file was closed and reopened. This is because, in the second read operation, the file is being read from RAM, thanks to the Linux kernel. Also note that if you run this program again, you’ll see that both operations finish quickly, because the file was cached from the previous run:

duration 1 = 0.10143039299873635
duration 2 = 0.08972924604313448

So in short, just use this code to preload the file into RAM:

with open(filepath, 'rb') as f:
    f.read()

Finally, it’s worth mentioning that page cache is practically non-deterministic, i.e. your file might not get fully cached, especially if the file size is big compared to the RAM. But in cases like that, caching the file might not be practical after all. In short, default OS page caching is not the most sophisticated way, but it is good enough for many use cases.

Answered By: Mostafa Farzán

Homer512 answered my question somewhat, mmap is a nice option that results in caching.

Alternatively, Numpy has some interfaces for parsing byte streams from a file directly into memory and it also comes with some string-related operations.

Answered By: AdmiralFishHead