Calculate the mean over a mixed data structure

Question:

I have a list of lists that looks something like:

data = [
    [1., np.array([2., 3., 4.]), ...],
    [5., np.array([6., 7., 8.]), ...],
    ...
]

where each of the internal lists is the same length and contains the same data type/shape at each entry. I would like to calculate the mean over corresponding entries and return something of the same structure as the internal lists. For example, in the above case (assuming only two entries) I want the result to be:

[3., np.array([4., 5., 6.]), ...]

What is the best way to do this with Python?

Asked By: Mead

||

Answers:

if you have a list exactly the same as the one shown in the example, you can do it with the following code.
First we declare some variables to store our results:

number_sum = 0
list_sum = np.array([0,0,0])

It is important that you initialize the values ​​you need to 0 in list_sum. That is, if the data array contains 5 elements, that array should be: list_sum = np.array([0,0,0,0,0]).

The next step is to perform the sum of all elements in data. First we add the int values ​​and then we perform the sum of each element of the list as follows:

for number, nparray in data:
    number_sum += number
    for index, item in enumerate(nparray):
        list_sum[index] += item

Since we know how the variable data is structured (each input is made up of an int value and an np.array) we can do the addition that way. Although be careful with the computational complexity because in examples with longer arrays it could become very high in terms of complexity, since two for loops are being nested.

Finally, you can check that if you divide the sum of the elements by the length of data you get the desired value:

print(number_sum/len(data))
print(list_sum/len(data))

Now you just have to add those two new values ​​to a new list. I hope it helps, greetings and good luck!

Answered By: a-Alarcon

The following works:

import numpy as np

data = [
    [1., np.array([2., 3., 4.]), np.array([[1., 1.], [1., 1.]])],
    [5., np.array([6., 7., 8.]), np.array([[3., 3.], [3., 3.]])],
]

number_of_samples = len(data)
number_of_elements = len(data[0])
means = []
for ielement in range(number_of_elements):
    mean_list = []
    for isample in range(number_of_samples):
        mean_list.append(data[isample][ielement])
    mean_list = np.stack(mean_list)
    mean = mean_list.mean(axis=0)
    means.append(mean)
print(means)

but is a bit ugly, nests a for loops, and does not seem to be very pythonic. Any improvements over this are welcomed.

Answered By: Mead

data is a list, so a list comprehension seems like a natural option. Even if it were a numpy array, given that it’s a jagged array, it wouldn’t benefit from being wrapped in an ndarray anyway, so a list comp would still be the best option, in my opinion.

Anyway, use zip() to "transpose" data and call np.mean() in a loop to find mean along the first axis.

[np.mean(x, axis=0) for x in zip(*data)]
# [3.0, array([4., 5., 6.]), array([[2., 2.], [2., 2.]])]
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.