Element wise mean of numpy arrays of different sizes

Question:

So there is a csv file I’m reading where I’m focusing on col3 where the rows have the values of different lengths where initially it was being read as a type str but was fixed using pd.eval.

``````df = pd.read_csv('datafile.csv', converters={'col3': pd.eval})

row e.g. [0, 100, -200, 300, -150...]
``````

There are many rows of different sizes and I want to calculate the element wise average, where I have followed this solution.
I first ran into the Numpy VisibleDeprecationWarning error which I fixed using this.
But for the last step of the solution using np.nanmean I’m running into a new error which is

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

My code looks like this so far:

``````import pandas as pd
import numpy as np
import itertools

datafile = df[(df['col1'] == 'Red') & (df['col2'] == Name) & ((df['col4'] == 'EX') | (df['col5'] == 'EX'))]

np.warnings.filterwarnings('ignore', category=np.VisibleDeprecationWarning)
ar = np.array(list(itertools.zip_longest(df['col3'], fillvalue=np.nan)))
print(ar)
np.nanmean(ar,axis=1)
``````

the arrays print like this

And the error is pointing towards the last line

The error I can see if pointing towards the arrays being of type object but I’m not sure how to fix it.

Make a ragged array:

``````In [23]: arr = np.array([np.arange(5), np.ones(5),np.zeros(3)],object)
In [24]: arr
Out[24]:
array([array([0, 1, 2, 3, 4]), array([1., 1., 1., 1., 1.]),
array([0., 0., 0.])], dtype=object)
``````

Note the shape and dtype.

Try to use mean on it:

``````In [25]: np.mean(arr)
Traceback (most recent call last):
Input In [25] in <cell line: 1>
np.mean(arr)
File <__array_function__ internals>:180 in mean
File /usr/local/lib/python3.10/dist-packages/numpy/core/fromnumeric.py:3432 in mean
return _methods._mean(a, axis=axis, dtype=dtype,
File /usr/local/lib/python3.10/dist-packages/numpy/core/_methods.py:180 in _mean
ret = umr_sum(arr, axis, dtype, out, keepdims, where=where)
ValueError: operands could not be broadcast together with shapes (5,) (3,)
``````

Apply mean to each element array works:

``````In [26]: [np.mean(a) for a in arr]
Out[26]: [2.0, 1.0, 0.0]
``````

Trying to use zip_longest:

``````In [27]: import itertools
In [28]: list(itertools.zip_longest(arr))
Out[28]:
[(array([0, 1, 2, 3, 4]),),
(array([1., 1., 1., 1., 1.]),),
(array([0., 0., 0.]),)]
``````

No change. We can use it by unpacking the arr – but it has padded the arrays in the wrong way:

``````In [29]: list(itertools.zip_longest(*arr))
Out[29]: [(0, 1.0, 0.0), (1, 1.0, 0.0), (2, 1.0, 0.0), (3, 1.0, None), (4, 1.0, None)]
``````

`zip_longest` can be used to pad lists, but it takes more thought than this.

If we make an array from that list:

``````In [35]: np.array(list(itertools.zip_longest(*arr,fillvalue=np.nan)))
Out[35]:
array([[ 0.,  1.,  0.],
[ 1.,  1.,  0.],
[ 2.,  1.,  0.],
[ 3.,  1., nan],
[ 4.,  1., nan]])
``````

and transpose it, we can take the `nanmean`:

``````In [39]: np.array(list(itertools.zip_longest(*arr,fillvalue=np.nan))).T
Out[39]:
array([[ 0.,  1.,  2.,  3.,  4.],
[ 1.,  1.,  1.,  1.,  1.],
[ 0.,  0.,  0., nan, nan]])
In [40]: np.nanmean(_, axis=1)
Out[40]: array([2., 1., 0.])
``````
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.