Can we read the data in a pickle file with python generators

Question:

I have a large pickle file and I want to load the data from pickle file to train a deep learning model. Is there any way if I can use a generator to load the data for each key? The data is in the form of a dictionary in the pickle file. I am using pickle.load(filename), but I am afraid that it will occupy too much RAM while running the model. I used pickle.HIGHEST_PROTOCOL to dump the data to the pickle file initially.

Asked By: Sree

||

Answers:

Nope. The pickle file format isn’t like JSON or something else where you can just read part of it and decode it incrementally. A pickle file is a list of instructions for building a Python object, and just like following half the instructions to bake a cake won’t bake half a cake, reading half a pickle won’t give you half the pickled object.

Answered By: user2357112

(I will consider to remove this answer if it’s unhelpful/unrelated/indirect-related to this question for SO community.)

The simple answer: impossible has been replied. But you still can solve this problem by following alternatives since your goal is to load small amount of data into finite memory in any moment for known file-based data:

  • Break down the dict into small dicts and re-pickle them again. And load smaller pickle file one by one.

    • Pro: Low effort to implement
    • Con: Loading order issue
  • Make an intermediate storage and load data on demand. This can be done by breaking down the pickle object into keys-only pickle object, and dict into intermediate storage. You will need an additional code to load the intermediate data from storage by key on demand.

    • Pro: Some efforts to implement
    • Con: No loading order issue
+------------------------+        +------+
| Original Pickle Object | -----> | Dict |
+------------------------+        +------+
             |                        |
          +-----+                +---------+
          | Key |                | storage | (K/V pair)
          +-----+                +---------+
             ↓                        ↓
   +-------------------+       +-------------+
   | Your Data Trainer | <---- | Data Loader |
   +-------------------+       +-------------+
Answered By: Kir Chou

What you could do is when saving the dict to the pickle file, iterate over all the key-value pairs in the dict when dumping them. Like this you can later yield them one by one from the pickle file, thus having to load into memory only one tuple (key-value pair) at a time. In order to access them, you could just filter like I do in the filter function. You can also implement more fancy filters there using regex.

import os
from pathlib import Path
import pickle
from typing import Generator, Dict, Any, Tuple


def init_dict() -> Dict[str, Any]:
    dct = {
        'item1' : 1,
        'item2' : 2,
        'item3' : True,
        'foo':  : 'bar'
    }
    return dct


def save_pickle(path: Path, dct: Dict[str, Any]) -> None:
    with open(path, 'wb') as f:
        for key in dct.items():
            pickle.dump(key, f, pickle.HIGHEST_PROTOCOL)


def load_pickle(path: Path) -> Dict[str, Any]:
    with open(path, 'rb') as f:
        while True:
            try:
                key = pickle.load(f)
                yield key
            except EOFError:
                break


def filter(keyvals: Tuple[str, Any], pattern: str) -> Any:
    for kv in keyvals:
        if pattern in kv[0]:
            yield kv


def filter_regex(keyvals: Tuple[str, Any],
                 pattern: str = 'item'
                 ) -> Generator[Path, None, None]:
    pat_comp = re.compile(pattern)
    return (kv for kv in keyvals if pat_comp.search(kv[0]))


if __name__ == '__main__':
    filename = 'myfile.pickle'
    path = Path(os.getcwd(), filename) # current working directory
    dct = init_dict()
    save_pickle(path, dct)
    keyvals = load_pickle(path) # keyvals is a generator
    keyvals = filter(keyvals, 'item')
    # keyvals = filter_regex(keyvals, 'item') # alternative
    for kv in keyvals:
        print(kv)
        print(kv[1])

Program output:

(myenv) ~Documentspython_programs>python stackoverflow_pickle_gen.py
('item1', 1)
1
('item2', 2)
2
('item3', True)
True
Answered By: ilja