Sum value in specific combinations of rows
Question:
I have the following dataframe:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'Name' : ['Jake', 'Nate', '', 'Alex', '', 'Max', 'Nate', 'Jake'],
'Color' : ['', 'red;blue', 'blue;pink', 'green;blue;red', '', '', 'blue', 'red;yellow'],
'Value_1' : [1211233.419, 4007489.726, 953474.6894, np.NaN, 1761987.704, 222600361, 404419.2243, 606066.067 ],
'Value_2' : [np.NaN, 1509907.457, 4792269.911, 43486.59312, np.NaN, np.NaN, 2066645.251, 60988660.37],
'Value_3' : [1175299.998, np.NaN, 1888559.459, np.NaN, 444689.0177, 405513.0572, 343704.0269, 2948494.383]})
---
Name Color Value_1 Value_2 Value_3
0 Jake 1.211233e+06 NaN 1.175300e+06
1 Nate red;blue 4.007490e+06 1.509907e+06 NaN
2 blue;pink 9.534747e+05 4.792270e+06 1.888559e+06
3 Alex green;blue;red NaN 4.348659e+04 NaN
4 1.761988e+06 NaN 4.446890e+05
5 Max 2.226004e+08 NaN 4.055131e+05
6 Nate blue 4.044192e+05 2.066645e+06 3.437040e+05
7 Jake red;yellow 6.060661e+05 6.098866e+07 2.948494e+06
I need two things:
1)In the first case I need to add all the values (Value_1, Value_2, Value_3) where I have the same name and get for example:
Name Value_1 Value_2 Value_3
0 Jake 1.817299e+06 6.098866e+07 4.123794e+06
1 Nate 4.411909e+06 3.576553e+06 3.437040e+05
2 Alex NaN 4.348659e+04 NaN
3 Max 2.226004e+08 NaN 4.055131e+05
2)I need the same thing but with the values of the name column plus the splits of the color column (only if there is at least one name and one color in the same row):
Name Color Value_1 Value_2 Value_3
0 Alex green NaN 4.348659e+04 NaN
1 Alex blue NaN 4.348659e+04 NaN
3 Alex red NaN 4.348659e+04 NaN
4 Jake red 6.060661e+05 6.098866e+07 2.948494e+06
5 Jake yellow 6.060661e+05 6.098866e+07 2.948494e+06
6 Nate red 4.007490e+06 1.509907e+06 NaN
7 Nate blue 4.411909e+06 3.576553e+06 3.437040e+05
(Note that in this case the only line present twice is Nate-Blue)
[Edit]
I apologize but I had not considered a further case and I am unable to resolve it:
For point 2: in all cases where I have several times the same color for the same name, separated by semicolon as in the example:
Name color Value_1 Value_2 Value_2
Max red;red 1 1 1
Jake b;b;b;y;y 1 1 1
Max red 3 3 3
I will receive something like:
Name color Value_1 Value_2 Value_2
Max red 5 5 5
Jake b 3 3 3
Jake y 2 2 2
Because it adds up each value for each color associated with that name
But I would like color repeats in the same row for the same name to be counted only once:
Name color Value_1 Value_2 Value_2
Max red 4 4 4
Jake b 1 1 1
Jake b 1 1 1
Answers:
df1['Color'] = df1['Color'].apply(lambda x: x.split(';'))
df1.explode('Color')
First replace empty strings in first 2 columns to mising values:
df1[['Name','Color']] = df1[['Name','Color']].replace('', np.nan)
Then aggregate sum
with min_count=1
for missing values instead 0
:
df2 = df1.groupby('Name', as_index=False).sum(min_count=1)
print (df2)
Name Value_1 Value_2 Value_3
0 Alex NaN 4.348659e+04 NaN
1 Jake 1.817299e+06 6.098866e+07 4.123794e+06
2 Max 2.226004e+08 NaN 4.055131e+05
3 Nate 4.411909e+06 3.576553e+06 3.437040e+05
For second ouput first use Series.str.split
with DataFrame.explode
and then aggregate sum
:
df3 = (df1.assign(Color=df1['Color'].str.split(';'))
.explode('Color')
.groupby(['Name', 'Color'], as_index=False)
.sum(min_count=1))
print (df3)
Name Color Value_1 Value_2 Value_3
0 Alex blue NaN 4.348659e+04 NaN
1 Alex green NaN 4.348659e+04 NaN
2 Alex red NaN 4.348659e+04 NaN
3 Jake red 6.060661e+05 6.098866e+07 2.948494e+06
4 Jake yellow 6.060661e+05 6.098866e+07 2.948494e+06
5 Nate blue 4.411909e+06 3.576553e+06 3.437040e+05
6 Nate red 4.007490e+06 1.509907e+06 NaN
EDIT: You can remove duplicates by all rows by DataFrame.drop_duplicates
:
df3 = (df1.assign(color=df1['color'].str.split(';'))
.explode('color')
.drop_duplicates()
.groupby(['Name', 'color'], as_index=False, sort=False)
.sum(min_count=1)
)
print (df3)
Name color Value_1 Value_2 Value_3
0 Max red 4 4 4
1 Jake b 1 1 1
2 Jake y 1 1 1
You can use:
(df1.assign(Color=df1['Color'].str.split(';'))
.explode('Color')
.groupby(['Name', 'Color'], as_index=False)
.sum()
.replace('', pd.NA).dropna()
)
output:
Name Color Value_1 Value_2 Value_3
3 Alex blue 0.000000e+00 4.348659e+04 0.000000e+00
4 Alex green 0.000000e+00 4.348659e+04 0.000000e+00
5 Alex red 0.000000e+00 4.348659e+04 0.000000e+00
7 Jake red 6.060661e+05 6.098866e+07 2.948494e+06
8 Jake yellow 6.060661e+05 6.098866e+07 2.948494e+06
10 Nate blue 4.411909e+06 3.576553e+06 3.437040e+05
11 Nate red 4.007490e+06 1.509907e+06 0.000000e+00
I have the following dataframe:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'Name' : ['Jake', 'Nate', '', 'Alex', '', 'Max', 'Nate', 'Jake'],
'Color' : ['', 'red;blue', 'blue;pink', 'green;blue;red', '', '', 'blue', 'red;yellow'],
'Value_1' : [1211233.419, 4007489.726, 953474.6894, np.NaN, 1761987.704, 222600361, 404419.2243, 606066.067 ],
'Value_2' : [np.NaN, 1509907.457, 4792269.911, 43486.59312, np.NaN, np.NaN, 2066645.251, 60988660.37],
'Value_3' : [1175299.998, np.NaN, 1888559.459, np.NaN, 444689.0177, 405513.0572, 343704.0269, 2948494.383]})
---
Name Color Value_1 Value_2 Value_3
0 Jake 1.211233e+06 NaN 1.175300e+06
1 Nate red;blue 4.007490e+06 1.509907e+06 NaN
2 blue;pink 9.534747e+05 4.792270e+06 1.888559e+06
3 Alex green;blue;red NaN 4.348659e+04 NaN
4 1.761988e+06 NaN 4.446890e+05
5 Max 2.226004e+08 NaN 4.055131e+05
6 Nate blue 4.044192e+05 2.066645e+06 3.437040e+05
7 Jake red;yellow 6.060661e+05 6.098866e+07 2.948494e+06
I need two things:
1)In the first case I need to add all the values (Value_1, Value_2, Value_3) where I have the same name and get for example:
Name Value_1 Value_2 Value_3
0 Jake 1.817299e+06 6.098866e+07 4.123794e+06
1 Nate 4.411909e+06 3.576553e+06 3.437040e+05
2 Alex NaN 4.348659e+04 NaN
3 Max 2.226004e+08 NaN 4.055131e+05
2)I need the same thing but with the values of the name column plus the splits of the color column (only if there is at least one name and one color in the same row):
Name Color Value_1 Value_2 Value_3
0 Alex green NaN 4.348659e+04 NaN
1 Alex blue NaN 4.348659e+04 NaN
3 Alex red NaN 4.348659e+04 NaN
4 Jake red 6.060661e+05 6.098866e+07 2.948494e+06
5 Jake yellow 6.060661e+05 6.098866e+07 2.948494e+06
6 Nate red 4.007490e+06 1.509907e+06 NaN
7 Nate blue 4.411909e+06 3.576553e+06 3.437040e+05
(Note that in this case the only line present twice is Nate-Blue)
[Edit]
I apologize but I had not considered a further case and I am unable to resolve it:
For point 2: in all cases where I have several times the same color for the same name, separated by semicolon as in the example:
Name color Value_1 Value_2 Value_2
Max red;red 1 1 1
Jake b;b;b;y;y 1 1 1
Max red 3 3 3
I will receive something like:
Name color Value_1 Value_2 Value_2
Max red 5 5 5
Jake b 3 3 3
Jake y 2 2 2
Because it adds up each value for each color associated with that name
But I would like color repeats in the same row for the same name to be counted only once:
Name color Value_1 Value_2 Value_2
Max red 4 4 4
Jake b 1 1 1
Jake b 1 1 1
df1['Color'] = df1['Color'].apply(lambda x: x.split(';'))
df1.explode('Color')
First replace empty strings in first 2 columns to mising values:
df1[['Name','Color']] = df1[['Name','Color']].replace('', np.nan)
Then aggregate sum
with min_count=1
for missing values instead 0
:
df2 = df1.groupby('Name', as_index=False).sum(min_count=1)
print (df2)
Name Value_1 Value_2 Value_3
0 Alex NaN 4.348659e+04 NaN
1 Jake 1.817299e+06 6.098866e+07 4.123794e+06
2 Max 2.226004e+08 NaN 4.055131e+05
3 Nate 4.411909e+06 3.576553e+06 3.437040e+05
For second ouput first use Series.str.split
with DataFrame.explode
and then aggregate sum
:
df3 = (df1.assign(Color=df1['Color'].str.split(';'))
.explode('Color')
.groupby(['Name', 'Color'], as_index=False)
.sum(min_count=1))
print (df3)
Name Color Value_1 Value_2 Value_3
0 Alex blue NaN 4.348659e+04 NaN
1 Alex green NaN 4.348659e+04 NaN
2 Alex red NaN 4.348659e+04 NaN
3 Jake red 6.060661e+05 6.098866e+07 2.948494e+06
4 Jake yellow 6.060661e+05 6.098866e+07 2.948494e+06
5 Nate blue 4.411909e+06 3.576553e+06 3.437040e+05
6 Nate red 4.007490e+06 1.509907e+06 NaN
EDIT: You can remove duplicates by all rows by DataFrame.drop_duplicates
:
df3 = (df1.assign(color=df1['color'].str.split(';'))
.explode('color')
.drop_duplicates()
.groupby(['Name', 'color'], as_index=False, sort=False)
.sum(min_count=1)
)
print (df3)
Name color Value_1 Value_2 Value_3
0 Max red 4 4 4
1 Jake b 1 1 1
2 Jake y 1 1 1
You can use:
(df1.assign(Color=df1['Color'].str.split(';'))
.explode('Color')
.groupby(['Name', 'Color'], as_index=False)
.sum()
.replace('', pd.NA).dropna()
)
output:
Name Color Value_1 Value_2 Value_3
3 Alex blue 0.000000e+00 4.348659e+04 0.000000e+00
4 Alex green 0.000000e+00 4.348659e+04 0.000000e+00
5 Alex red 0.000000e+00 4.348659e+04 0.000000e+00
7 Jake red 6.060661e+05 6.098866e+07 2.948494e+06
8 Jake yellow 6.060661e+05 6.098866e+07 2.948494e+06
10 Nate blue 4.411909e+06 3.576553e+06 3.437040e+05
11 Nate red 4.007490e+06 1.509907e+06 0.000000e+00