How to fusion cells of a dataframe by summation

Question:

I want to transform my dataframe by merging it cells and summing them into other larger cells given the indices of those, as an example, given the indices [0,2] & [2,4] on the X and Y axis and go from the following dataframe :

+----+----+----+----+
| 1  | 2  | 3  | 4  |
+----+----+----+----+
| 5  | 6  | 7  | 8  |
+----+----+----+----+
| 9  | 10 | 11 | 12 |
+----+----+----+----+
| 13 | 14 | 15 | 16 |
+----+----+----+----+

to the following one:

+----+----+
| 14 | 22 |
+----+----+
| 46 | 54 |
+----+----+

I was thinking of Pandas’ groupBY.transform or rolling would be of help.
Any clues?

Asked By: marc nicole

||

Answers:

def fn(x):
    x.index = x.columns = x.index//2
    y = x.stack().groupby(level = [0,1]).sum().unstack()
    y.index.name = y.columns.name = None
    return y

df = pd.DataFrame({0 : [1, 5, 9, 13], 1 : [2, 6, 10, 14],
              2 : [3, 7, 11, 15], 3 : [4, 8, 12, 16]})
fn(df.copy())
    0   1
0  14  22
1  46  54
Answered By: Onyambu

Assuming you have homogenous blocks (e.g, 2×2), the most efficient would be
to reshape the underlying array and sum:

N = 2

out = pd.DataFrame(df.to_numpy()
                     # convert the 2D array to 4D
                     .reshape(len(df)//N, N, -1, N)
                     # sum along dimensions 1 and 3 to go back to 2D
                     .sum((1, 3))
                   )

If you want non-square blocks (RxC):

R, C = 2, 2

out = pd.DataFrame(df.to_numpy()
                     .reshape(len(df)//R, R, df.shape[1]//C, C)
                     .sum((1, 3))
                   )

Output:

    0   1
0  14  22
1  46  54

Intermediate 4D array:

# df.to_numpy().reshape((len(df)//N, N, -1, N))

array([[[[ 1,  2],    # ──┐
         [ 3,  4]],   # ─┐├─> 1+2+5+6 = 14
                      #  ││
        [[ 5,  6],    # ──┘
         [ 7,  8]]],  # ─┴──> 3+4+7+8 = 22


       [[[ 9, 10],    # ──┐
         [11, 12]],   # ─┐├─>  9+10+13+14 = 46
                      #  ││
        [[13, 14],    # ──┘
         [15, 16]]]]) # ─┴──> 11+12+15+16 = 54
Answered By: mozway

Using list comprehension combined with np.array_split for faster code instead of a slower for loop :

import pandas as pd

data = [[1, 2, 3, 4],
        [5, 6, 7, 8],
        [9, 10, 11, 12],
        [13, 14, 15, 16]]

df = pd.DataFrame(data)

df_len = len(df)//2

# Reshape the list into a 2x2 array
arr = np.array(   [arr2.sum().sum() for arr1 in np.array_split(df, df_len) 
                            for arr2 in np.array_split(arr1, df_len, axis=1)]
                  ).reshape((df_len, df_len))

# Convert the array to a DataFrame
result = pd.DataFrame(arr, columns=['col1', 'col2'])

print(result)
   col1  col2
0    14    22
1    46    54

Timings

Tested with time_it :

  • Laurent_B :
Temps d'exécution du script: 0.004486 secondes
  • Mozway :
Temps d'exécution du script: 0.000242 secondes
  • Onyambu
Temps d'exécution du script: 0.004269 secondes

In definitive, Mozway has the fastest script using pure numpy components, even if other approaches work too.

Answered By: Laurent B.
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.