How to read large sample of images effectively without overloading ram memory?
Question:
While training the classification model I’m passing input image samples as NumPy array but when I try to train large dataset samples I run into memory error. Currently, I’ve 120 GB size of memory even with this size I run into memory error. I’ve enclosed code snippet below
x_train = np.array([np.array(ndimage.imread(image)) for image in image_list])
x_train = x_train.astype(np.float32)
Error traceback:
x_train = x_train.astype(np.float32) numpy.core._exceptions.MemoryError: Unable to
allocate 134. GiB for an array with shape (2512019,82,175,1) and data type float32
How can I fix this issue without increasing ram size? is there a better way to read the data like using cache or using protobuf?
Answers:
I would load the first half of the dataset and then train the model on the first half of the dataset and then I would load the 2nd half and train the model on the 2nd part of the dataset. This does not influence the result.
The easiest way to split your dataset is to simply make a 2nd folder with the same structure with 50% of the dataset.
The pseudo-code for that method of training would look like this.:
- load dataset 1
- train the model with dataset 1
- load dataset 2 but into the same variable as the first one to reuse the memory instead of creating a 2nd variable with the first one still in memory
- train the model with dataset 2
A second option to decrease the memory size of your array you could use np.float16
instead of np.float32
but this would result in a more inaccurate model. The difference is data-dependent. So it could be 1-2% or even 5-10% so the only option without losing any accuracy is the option is described above.
EDIT
I am going to add the actual code.
import cv2 #pip install opencv-python
import os
part1_of_dataset = os.listdir("Path_to_your_first_dataset")
part2_of_dataset = os.listdir("Path_to_your_second_dataset")
x_train = np.array([cv2.imread(image) for image in part1_of_dataset],dtype=np.float32)
x_train /=255.0
x_train_m = np.mean(x_train,axis=0)
x_train -= x_train_m
model.fit(x_train,y_train) #not the full training code just an example
x_train = np.array([cv2.imread(image) for image in part2_of_dataset],dtype=np.float32)
x_train /=255.0
x_train_m = np.mean(x_train,axis=0)
x_train -= x_train_m
model.fit(x_train,y_train) #not the full training code just an example
This question comes up just as I put the first 2 32GB RAM sticks into my pc today for pretty much the same reason.
At this point it becomes necsessary to handle the data different.
I am not sure what you are using to do the learning. But if its tensorflow you can customize your input pipeline.
Anyways. It comes down to correctly analyze what you want to do with the data and the capabilities of your environment. If the data is ready to train and you just load it from disk it should not be a problem to only load then train on only a portion of it, then go to the next portion and so on.
You can split this data into multiple files or partially load the data (there are datatypes/fileformats to help with that). You can even optimize this so far that you can read from disk during training and have the next batch ready to go when you need it.
While training the classification model I’m passing input image samples as NumPy array but when I try to train large dataset samples I run into memory error. Currently, I’ve 120 GB size of memory even with this size I run into memory error. I’ve enclosed code snippet below
x_train = np.array([np.array(ndimage.imread(image)) for image in image_list])
x_train = x_train.astype(np.float32)
Error traceback:
x_train = x_train.astype(np.float32) numpy.core._exceptions.MemoryError: Unable to
allocate 134. GiB for an array with shape (2512019,82,175,1) and data type float32
How can I fix this issue without increasing ram size? is there a better way to read the data like using cache or using protobuf?
I would load the first half of the dataset and then train the model on the first half of the dataset and then I would load the 2nd half and train the model on the 2nd part of the dataset. This does not influence the result.
The easiest way to split your dataset is to simply make a 2nd folder with the same structure with 50% of the dataset.
The pseudo-code for that method of training would look like this.:
- load dataset 1
- train the model with dataset 1
- load dataset 2 but into the same variable as the first one to reuse the memory instead of creating a 2nd variable with the first one still in memory
- train the model with dataset 2
A second option to decrease the memory size of your array you could use np.float16
instead of np.float32
but this would result in a more inaccurate model. The difference is data-dependent. So it could be 1-2% or even 5-10% so the only option without losing any accuracy is the option is described above.
EDIT
I am going to add the actual code.
import cv2 #pip install opencv-python
import os
part1_of_dataset = os.listdir("Path_to_your_first_dataset")
part2_of_dataset = os.listdir("Path_to_your_second_dataset")
x_train = np.array([cv2.imread(image) for image in part1_of_dataset],dtype=np.float32)
x_train /=255.0
x_train_m = np.mean(x_train,axis=0)
x_train -= x_train_m
model.fit(x_train,y_train) #not the full training code just an example
x_train = np.array([cv2.imread(image) for image in part2_of_dataset],dtype=np.float32)
x_train /=255.0
x_train_m = np.mean(x_train,axis=0)
x_train -= x_train_m
model.fit(x_train,y_train) #not the full training code just an example
This question comes up just as I put the first 2 32GB RAM sticks into my pc today for pretty much the same reason.
At this point it becomes necsessary to handle the data different.
I am not sure what you are using to do the learning. But if its tensorflow you can customize your input pipeline.
Anyways. It comes down to correctly analyze what you want to do with the data and the capabilities of your environment. If the data is ready to train and you just load it from disk it should not be a problem to only load then train on only a portion of it, then go to the next portion and so on.
You can split this data into multiple files or partially load the data (there are datatypes/fileformats to help with that). You can even optimize this so far that you can read from disk during training and have the next batch ready to go when you need it.