Calculate mean of large numpy array which is memmapped from hdf5 file
Question:
I’ve a problem with calculating the mean of an array in numpy that is too large for RAM (~100G).
I’ve looked into using np.memmap
, but unfortunately my array is stored as a dataset in a hdf5 file. And based on what I’ve tried, np.memmap doesn’t accept hdf5 datasets as input.
TypeError: coercing to Unicode: need string or buffer, Dataset found
So how can I call np.mean
on a memory mapped array from disk in an efficient way? Of course I could loop over parts of the dataset, where each part fits into the memory.
However, this feels too much like a workaround and I’m also not sure if it would achieve the best performance.
Here’s some sample code:
data = np.randint(0, 255, 100000*10*10*10, dtype=np.uint8)
data.reshape((100000,10,10,10)) # typically lot larger, ~100G
hdf5_file = h5py.File('data.h5', 'w')
hdf5_file.create_dataset('x', data=data, dtype='uint8')
def get_mean_image(filepath):
"""
Returns the mean_array of a dataset.
"""
f = h5py.File(filepath, "r")
xs_mean = np.mean(f['x'], axis=0) # memory error with large enough array
return xs_mean
xs_mean = get_mean_image('./data.h5')
Answers:
As hpaulj suggested in the comments, I’ve just split the mean calculation into multiple steps.
Here’s my (simplified) code if it may be useful for someone:
def get_mean_image(filepath):
"""
Returns the mean_image of a xs dataset.
:param str filepath: Filepath of the data upon which the mean_image should be calculated.
:return: ndarray xs_mean: mean_image of the x dataset.
"""
f = h5py.File(filepath, "r")
# check available memory and divide the mean calculation in steps
total_memory = 0.5 * psutil.virtual_memory() # In bytes. Take 1/2 of what is available, just to make sure.
filesize = os.path.getsize(filepath)
steps = int(np.ceil(filesize/total_memory))
n_rows = f['x'].shape[0]
stepsize = int(n_rows / float(steps))
xs_mean_arr = None
for i in xrange(steps):
if xs_mean_arr is None: # create xs_mean_arr that stores intermediate mean_temp results
xs_mean_arr = np.zeros((steps, ) + f['x'].shape[1:], dtype=np.float64)
if i == steps-1: # for the last step, calculate mean till the end of the file
xs_mean_temp = np.mean(f['x'][i * stepsize: n_rows], axis=0, dtype=np.float64)
else:
xs_mean_temp = np.mean(f['x'][i*stepsize : (i+1) * stepsize], axis=0, dtype=np.float64)
xs_mean_arr[i] = xs_mean_temp
xs_mean = np.mean(xs_mean_arr, axis=0, dtype=np.float64).astype(np.float32)
return xs_mean
A better mean calculation algorithm would be:
N = x.shape[0]
batch_size = 32
num_steps = math.ceil(N / batch_size)
mean = np.zeros(x.shape[1:])
for i in range(num_steps):
x_batch = x[i * batch_size: (i + 1) * batch_size]
curr_batch_size = x_batch.shape[0]
mean += x_batch.mean(0) * curr_batch_size / N
# x_batch.sum(0) / N, alternatively
Basically, (a_1 + a_2 + . . . + a_N) / N = a_1 / N + a_2 / N + . . . + a_N / N
.
This is more precise than computing mean of means (which gives a slightly wrong result when last batch is of a different size), and also doesn’t have the memory overhead of storing the means of chunks, as all you’re doing is a running sum reduction.
I’ve a problem with calculating the mean of an array in numpy that is too large for RAM (~100G).
I’ve looked into using np.memmap
, but unfortunately my array is stored as a dataset in a hdf5 file. And based on what I’ve tried, np.memmap doesn’t accept hdf5 datasets as input.
TypeError: coercing to Unicode: need string or buffer, Dataset found
So how can I call np.mean
on a memory mapped array from disk in an efficient way? Of course I could loop over parts of the dataset, where each part fits into the memory.
However, this feels too much like a workaround and I’m also not sure if it would achieve the best performance.
Here’s some sample code:
data = np.randint(0, 255, 100000*10*10*10, dtype=np.uint8)
data.reshape((100000,10,10,10)) # typically lot larger, ~100G
hdf5_file = h5py.File('data.h5', 'w')
hdf5_file.create_dataset('x', data=data, dtype='uint8')
def get_mean_image(filepath):
"""
Returns the mean_array of a dataset.
"""
f = h5py.File(filepath, "r")
xs_mean = np.mean(f['x'], axis=0) # memory error with large enough array
return xs_mean
xs_mean = get_mean_image('./data.h5')
As hpaulj suggested in the comments, I’ve just split the mean calculation into multiple steps.
Here’s my (simplified) code if it may be useful for someone:
def get_mean_image(filepath):
"""
Returns the mean_image of a xs dataset.
:param str filepath: Filepath of the data upon which the mean_image should be calculated.
:return: ndarray xs_mean: mean_image of the x dataset.
"""
f = h5py.File(filepath, "r")
# check available memory and divide the mean calculation in steps
total_memory = 0.5 * psutil.virtual_memory() # In bytes. Take 1/2 of what is available, just to make sure.
filesize = os.path.getsize(filepath)
steps = int(np.ceil(filesize/total_memory))
n_rows = f['x'].shape[0]
stepsize = int(n_rows / float(steps))
xs_mean_arr = None
for i in xrange(steps):
if xs_mean_arr is None: # create xs_mean_arr that stores intermediate mean_temp results
xs_mean_arr = np.zeros((steps, ) + f['x'].shape[1:], dtype=np.float64)
if i == steps-1: # for the last step, calculate mean till the end of the file
xs_mean_temp = np.mean(f['x'][i * stepsize: n_rows], axis=0, dtype=np.float64)
else:
xs_mean_temp = np.mean(f['x'][i*stepsize : (i+1) * stepsize], axis=0, dtype=np.float64)
xs_mean_arr[i] = xs_mean_temp
xs_mean = np.mean(xs_mean_arr, axis=0, dtype=np.float64).astype(np.float32)
return xs_mean
A better mean calculation algorithm would be:
N = x.shape[0]
batch_size = 32
num_steps = math.ceil(N / batch_size)
mean = np.zeros(x.shape[1:])
for i in range(num_steps):
x_batch = x[i * batch_size: (i + 1) * batch_size]
curr_batch_size = x_batch.shape[0]
mean += x_batch.mean(0) * curr_batch_size / N
# x_batch.sum(0) / N, alternatively
Basically, (a_1 + a_2 + . . . + a_N) / N = a_1 / N + a_2 / N + . . . + a_N / N
.
This is more precise than computing mean of means (which gives a slightly wrong result when last batch is of a different size), and also doesn’t have the memory overhead of storing the means of chunks, as all you’re doing is a running sum reduction.