How to read large sample of images effectively without overloading ram memory?

Question:

While training the classification model I’m passing input image samples as NumPy array but when I try to train large dataset samples I run into memory error. Currently, I’ve 120 GB size of memory even with this size I run into memory error. I’ve enclosed code snippet below

x_train = np.array([np.array(ndimage.imread(image)) for image in image_list])
x_train = x_train.astype(np.float32)

Error traceback:

x_train = x_train.astype(np.float32) numpy.core._exceptions.MemoryError: Unable to 
allocate  134. GiB for an array with shape (2512019,82,175,1) and data type float32

How can I fix this issue without increasing ram size? is there a better way to read the data like using cache or using protobuf?

Asked By: steve

||

Answers:

I would load the first half of the dataset and then train the model on the first half of the dataset and then I would load the 2nd half and train the model on the 2nd part of the dataset. This does not influence the result.

The easiest way to split your dataset is to simply make a 2nd folder with the same structure with 50% of the dataset.

The pseudo-code for that method of training would look like this.:

  1. load dataset 1
  2. train the model with dataset 1
  3. load dataset 2 but into the same variable as the first one to reuse the memory instead of creating a 2nd variable with the first one still in memory
  4. train the model with dataset 2

A second option to decrease the memory size of your array you could use np.float16 instead of np.float32 but this would result in a more inaccurate model. The difference is data-dependent. So it could be 1-2% or even 5-10% so the only option without losing any accuracy is the option is described above.

EDIT

I am going to add the actual code.

import cv2 #pip install opencv-python
import os 

part1_of_dataset = os.listdir("Path_to_your_first_dataset")
part2_of_dataset = os.listdir("Path_to_your_second_dataset")

x_train = np.array([cv2.imread(image) for image in part1_of_dataset],dtype=np.float32)
x_train /=255.0
x_train_m = np.mean(x_train,axis=0)
x_train -= x_train_m

model.fit(x_train,y_train) #not the full training code just an example


x_train = np.array([cv2.imread(image) for image in part2_of_dataset],dtype=np.float32)
x_train /=255.0
x_train_m = np.mean(x_train,axis=0)
x_train -= x_train_m

model.fit(x_train,y_train) #not the full training code just an example

This question comes up just as I put the first 2 32GB RAM sticks into my pc today for pretty much the same reason.

At this point it becomes necsessary to handle the data different.

I am not sure what you are using to do the learning. But if its tensorflow you can customize your input pipeline.

Anyways. It comes down to correctly analyze what you want to do with the data and the capabilities of your environment. If the data is ready to train and you just load it from disk it should not be a problem to only load then train on only a portion of it, then go to the next portion and so on.

You can split this data into multiple files or partially load the data (there are datatypes/fileformats to help with that). You can even optimize this so far that you can read from disk during training and have the next batch ready to go when you need it.

Answered By: t0b4cc0
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.