How to fusion cells of a dataframe by summation
Question:
I want to transform my dataframe by merging it cells and summing them into other larger cells given the indices of those, as an example, given the indices [0,2] & [2,4]
on the X
and Y
axis and go from the following dataframe :
+----+----+----+----+
| 1 | 2 | 3 | 4 |
+----+----+----+----+
| 5 | 6 | 7 | 8 |
+----+----+----+----+
| 9 | 10 | 11 | 12 |
+----+----+----+----+
| 13 | 14 | 15 | 16 |
+----+----+----+----+
to the following one:
+----+----+
| 14 | 22 |
+----+----+
| 46 | 54 |
+----+----+
I was thinking of Pandas’ groupBY.transform
or rolling
would be of help.
Any clues?
Answers:
def fn(x):
x.index = x.columns = x.index//2
y = x.stack().groupby(level = [0,1]).sum().unstack()
y.index.name = y.columns.name = None
return y
df = pd.DataFrame({0 : [1, 5, 9, 13], 1 : [2, 6, 10, 14],
2 : [3, 7, 11, 15], 3 : [4, 8, 12, 16]})
fn(df.copy())
0 1
0 14 22
1 46 54
Assuming you have homogenous blocks (e.g, 2×2), the most efficient would be
to reshape
the underlying numpy array and sum
:
N = 2
out = pd.DataFrame(df.to_numpy()
# convert the 2D array to 4D
.reshape(len(df)//N, N, -1, N)
# sum along dimensions 1 and 3 to go back to 2D
.sum((1, 3))
)
If you want non-square blocks (RxC):
R, C = 2, 2
out = pd.DataFrame(df.to_numpy()
.reshape(len(df)//R, R, df.shape[1]//C, C)
.sum((1, 3))
)
Output:
0 1
0 14 22
1 46 54
Intermediate 4D array:
# df.to_numpy().reshape((len(df)//N, N, -1, N))
array([[[[ 1, 2], # ──┐
[ 3, 4]], # ─┐├─> 1+2+5+6 = 14
# ││
[[ 5, 6], # ──┘
[ 7, 8]]], # ─┴──> 3+4+7+8 = 22
[[[ 9, 10], # ──┐
[11, 12]], # ─┐├─> 9+10+13+14 = 46
# ││
[[13, 14], # ──┘
[15, 16]]]]) # ─┴──> 11+12+15+16 = 54
Using list comprehension combined with np.array_split
for faster code instead of a slower for
loop :
import pandas as pd
data = [[1, 2, 3, 4],
[5, 6, 7, 8],
[9, 10, 11, 12],
[13, 14, 15, 16]]
df = pd.DataFrame(data)
df_len = len(df)//2
# Reshape the list into a 2x2 array
arr = np.array( [arr2.sum().sum() for arr1 in np.array_split(df, df_len)
for arr2 in np.array_split(arr1, df_len, axis=1)]
).reshape((df_len, df_len))
# Convert the array to a DataFrame
result = pd.DataFrame(arr, columns=['col1', 'col2'])
print(result)
col1 col2
0 14 22
1 46 54
Timings
Tested with time_it
:
- Laurent_B :
Temps d'exécution du script: 0.004486 secondes
- Mozway :
Temps d'exécution du script: 0.000242 secondes
- Onyambu
Temps d'exécution du script: 0.004269 secondes
In definitive, Mozway has the fastest script using pure numpy components, even if other approaches work too.
I want to transform my dataframe by merging it cells and summing them into other larger cells given the indices of those, as an example, given the indices [0,2] & [2,4]
on the X
and Y
axis and go from the following dataframe :
+----+----+----+----+
| 1 | 2 | 3 | 4 |
+----+----+----+----+
| 5 | 6 | 7 | 8 |
+----+----+----+----+
| 9 | 10 | 11 | 12 |
+----+----+----+----+
| 13 | 14 | 15 | 16 |
+----+----+----+----+
to the following one:
+----+----+
| 14 | 22 |
+----+----+
| 46 | 54 |
+----+----+
I was thinking of Pandas’ groupBY.transform
or rolling
would be of help.
Any clues?
def fn(x):
x.index = x.columns = x.index//2
y = x.stack().groupby(level = [0,1]).sum().unstack()
y.index.name = y.columns.name = None
return y
df = pd.DataFrame({0 : [1, 5, 9, 13], 1 : [2, 6, 10, 14],
2 : [3, 7, 11, 15], 3 : [4, 8, 12, 16]})
fn(df.copy())
0 1
0 14 22
1 46 54
Assuming you have homogenous blocks (e.g, 2×2), the most efficient would be
to reshape
the underlying numpy array and sum
:
N = 2
out = pd.DataFrame(df.to_numpy()
# convert the 2D array to 4D
.reshape(len(df)//N, N, -1, N)
# sum along dimensions 1 and 3 to go back to 2D
.sum((1, 3))
)
If you want non-square blocks (RxC):
R, C = 2, 2
out = pd.DataFrame(df.to_numpy()
.reshape(len(df)//R, R, df.shape[1]//C, C)
.sum((1, 3))
)
Output:
0 1
0 14 22
1 46 54
Intermediate 4D array:
# df.to_numpy().reshape((len(df)//N, N, -1, N))
array([[[[ 1, 2], # ──┐
[ 3, 4]], # ─┐├─> 1+2+5+6 = 14
# ││
[[ 5, 6], # ──┘
[ 7, 8]]], # ─┴──> 3+4+7+8 = 22
[[[ 9, 10], # ──┐
[11, 12]], # ─┐├─> 9+10+13+14 = 46
# ││
[[13, 14], # ──┘
[15, 16]]]]) # ─┴──> 11+12+15+16 = 54
Using list comprehension combined with np.array_split
for faster code instead of a slower for
loop :
import pandas as pd
data = [[1, 2, 3, 4],
[5, 6, 7, 8],
[9, 10, 11, 12],
[13, 14, 15, 16]]
df = pd.DataFrame(data)
df_len = len(df)//2
# Reshape the list into a 2x2 array
arr = np.array( [arr2.sum().sum() for arr1 in np.array_split(df, df_len)
for arr2 in np.array_split(arr1, df_len, axis=1)]
).reshape((df_len, df_len))
# Convert the array to a DataFrame
result = pd.DataFrame(arr, columns=['col1', 'col2'])
print(result)
col1 col2
0 14 22
1 46 54
Timings
Tested with time_it
:
- Laurent_B :
Temps d'exécution du script: 0.004486 secondes
- Mozway :
Temps d'exécution du script: 0.000242 secondes
- Onyambu
Temps d'exécution du script: 0.004269 secondes
In definitive, Mozway has the fastest script using pure numpy components, even if other approaches work too.