How to make a multidimension numpy array with a varying row size?

Question:

I would like to create a two dimensional numpy array of arrays that has a different number of elements on each row.

Trying

cells = numpy.array([[0,1,2,3], [2,3,4]])

gives an error

ValueError: setting an array element with a sequence.
Asked By: D R

||

Answers:

While Numpy knows about arrays of arbitrary objects, it’s optimized for homogeneous arrays of numbers with fixed dimensions. If you really need arrays of arrays, better use a nested list. But depending on the intended use of your data, different data structures might be even better, e.g. a masked array if you have some invalid data points.

If you really want flexible Numpy arrays, use something like this:

numpy.array([[0,1,2,3], [2,3,4]], dtype=object)

However this will create a one-dimensional array that stores references to lists, which means that you will lose most of the benefits of Numpy (vector processing, locality, slicing, etc.).

Answered By: Philipp

This isn’t well supported in Numpy (by definition, almost everywhere, a “two dimensional array” has all rows of equal length). A Python list of Numpy arrays may be a good solution for you, as this way you’ll get the advantages of Numpy where you can use them:

cells = [numpy.array(a) for a in [[0,1,2,3], [2,3,4]]]
Answered By: tom10

We are now almost 7 years after the question was asked, and your code

cells = numpy.array([[0,1,2,3], [2,3,4]])

executed in numpy 1.12.0, python 3.5, doesn’t produce any error and
cells contains:

array([[0, 1, 2, 3], [2, 3, 4]], dtype=object)

You access your cells elements as cells[0][2] # (=2) .

An alternative to tom10’s solution if you want to build your list of numpy arrays on the fly as new elements (i.e. arrays) become available is to use append:

d = []                 # initialize an empty list
a = np.arange(3)       # array([0, 1, 2])
d.append(a)            # [array([0, 1, 2])]
b = np.arange(3,-1,-1) #array([3, 2, 1, 0])
d.append(b)            #[array([0, 1, 2]), array([3, 2, 1, 0])]
Answered By: calocedrus

Another option would be to store your arrays as one contiguous array and also store their sizes or offsets. This takes a little more conceptual thought around how to operate on your arrays, but a surprisingly large number of operations can be made to work as if you had a two dimensional array with different sizes. In the cases where they can’t, then np.split can be used to create the list that calocedrus recommends. The easiest operations are ufuncs, because they require almost no modification. Here are some examples:

cells_flat = numpy.array([0, 1, 2, 3, 2, 3, 4])
# One of these is required, it's pretty easy to convert between them,
# but having both makes the examples easy
cell_lengths = numpy.array([4, 3])
cell_starts = numpy.insert(cell_lengths[:-1].cumsum(), 0, 0)
cell_lengths2 = numpy.diff(numpy.append(cell_starts, cells_flat.size))
assert np.all(cell_lengths == cell_lengths2)

# Copy prevents shared memory
cells = numpy.split(cells_flat.copy(), cell_starts[1:])
# [array([0, 1, 2, 3]), array([2, 3, 4])]

numpy.array([x.sum() for x in cells])
# array([6, 9])
numpy.add.reduceat(cells_flat, cell_starts)
# array([6, 9])

[a + v for a, v in zip(cells, [1, 3])]
# [array([1, 2, 3, 4]), array([5, 6, 7])]
cells_flat + numpy.repeat([1, 3], cell_lengths)
# array([1, 2, 3, 4, 5, 6, 7])

[a.astype(float) / a.sum() for a in cells]
# [array([ 0.        ,  0.16666667,  0.33333333,  0.5       ]),
#  array([ 0.22222222,  0.33333333,  0.44444444])]
cells_flat.astype(float) / np.add.reduceat(cells_flat, cell_starts).repeat(cell_lengths)
# array([ 0.        ,  0.16666667,  0.33333333,  0.5       ,  0.22222222,
#         0.33333333,  0.44444444])

def complex_modify(array):
    """Some complicated function that modifies array

    pretend this is more complex than it is"""
    array *= 3

for arr in cells:
    complex_modify(arr)
cells
# [array([0, 3, 6, 9]), array([ 6,  9, 12])]
for arr in numpy.split(cells_flat, cell_starts[1:]):
    complex_modify(arr)
cells_flat
# array([ 0,  3,  6,  9,  6,  9, 12])
Answered By: Erik

In numpy 1.14.3, using append:

d = []                 # initialize an empty list
a = np.arange(3)       # array([0, 1, 2])
d.append(a)            # [array([0, 1, 2])]
b = np.arange(3,-1,-1) #array([3, 2, 1, 0])
d.append(b)            #[array([0, 1, 2]), array([3, 2, 1, 0])]

what you get an list of arrays (that can be of different lengths) and you can do operations like d[0].mean(). On the other hand,

cells = numpy.array([[0,1,2,3], [2,3,4]])

results in an array of lists.

You may want to do this:

a1 = np.array([1,2,3])
a2 = np.array([3,4])
a3 = np.array([a1,a2])
a3 # array([array([1, 2, 3]), array([3, 4])], dtype=object)
type(a3) # numpy.ndarray
type(a2) # numpy.ndarray

Slightly off-topic, but not as much as one would think because of eager mode which is now the default:
If you are using Tensorflow, you can do:

a = tf.ragged.constant([[0, 1, 2, 3]])
b = tf.ragged.constant([[2, 3, 4]])
c = tf.concat([a, b], axis=0)

And you can then do all the mathematical operations still, like tf.math.reduce_mean, etc.

np.array([[0,1,2,3], [2,3,4]], dtype=object) returns an "array" of lists.

a = np.array([np.array([0,1,2,3]), np.array([2,3,4])], dtype=object) returns an array of arrays. It allows already for operations such as a+1.

Building up on this, the functionality can be enhanced by subclassing.

import numpy as np

class Arrays(np.ndarray):
    def __new__(cls, input_array, dims=None):
        obj = np.array(list(map(np.array, input_array))).view(cls)
        return obj
    def __getitem__(self, ij):
        if isinstance(ij, tuple) and len(ij) > 1:
            # handle twodimensional slicing
            if isinstance(ij[0],slice) or hasattr(ij[0], '__iter__'):
                # [1:4,:] or [[1,2,3],[1,2]]
                return Arrays(arr[ij[1]] for arr in self[ij[0]])
            return self[ij[0]][ij[1]] # [1,:] np.array
        return super(Arrays, self).__getitem__(ij)
    def __array_ufunc__(self, ufunc, method, *inputs, **kwargs):
        axis = kwargs.pop('axis', None)
        dimk = [len(arg) if hasattr(arg, '__iter__') else 1 for arg in inputs]
        dim = max(dimk)
        pad_inputs = [([i]*dim if (d<dim) else i) for d,i in zip(dimk, inputs)]
        result = [np.ndarray.__array_ufunc__(self, ufunc, method, *x, **kwargs) for x in zip(*pad_inputs)]
        if method == 'reduce':
            # handle sum, min, max, etc.
            if axis == 1:
                return np.array(result)
            else:
                # repeat over remaining axis
                return np.ndarray.__array_ufunc__(self, ufunc, method, result, **kwargs)
        return Arrays(result)

Now this works:

a = Arrays([[0,1,2,3], [2,3,4]])
a[0:1,0:-1]
# Arrays([[0, 1, 2]])
np.sin(a)
# Arrays([array([0.        , 0.84147098, 0.90929743, 0.14112001]),
#        array([ 0.90929743,  0.14112001, -0.7568025 ])], dtype=object)
a + 2*a
# Arrays([array([0, 3, 6, 9]), array([ 6,  9, 12])], dtype=object)

To get nanfunctions working, this can be done

# patch for nanfunction that cannot handle the object-ndarrays along with second axis=-1
def nanpatch(func):
    def wrapper(a, axis=None, **kwargs):
        if isinstance(a, Arrays):
            rowresult = [func(x, **kwargs) for x in a]
            if axis == 1:
                return np.array(rowresult)
            else:
                # repeat over remaining axis
                return func(rowresult)
        # otherwise keep the original version
        return func(a, axis=axis, **kwargs)
    return wrapper

np.nanmean = nanpatch(np.nanmean)
np.nansum = nanpatch(np.nansum)
np.nanmin = nanpatch(np.nanmin)
np.nanmax = nanpatch(np.nanmax)
np.nansum(a)
# 15
np.nansum(a, axis=1)
# array([6, 9])

If anyone is still looking for this 13 years after the original post: Awkward arrays appear to be just what you’re looking for: https://awkward-array.org/doc/main/
They are an implementation of arrays with varying row size with numpy-like syntax and performance, originally motivated by needs of high-energy-physics analysis.

Answered By: Jan
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.