Python dataframe groupby multiple columns with conditional sum
Question:
I have a df
which looks like that:
col1 col2 now previous target
A 1 1-1-2015 4-1-2014 0.2
B 0 2-1-2015 2-5-2014 0.33
A 0 3-1-2013 3-9-2011 0.1
A 1 1-1-2014 4-9-2011 1.7
A 1 31-12-2014 4-9-2014 1.9
I am grouping the df
by col1 and col2, and for each member of each group, I want to sum the target
values, only of other group members, that their now
date value, is smaller(before) than the current member’s previous
date value.
For example for:
col1 col2 now previous target
A 1 1-1-2015 4-1-2014 0.2
I want to sum the target values of:
col1 col2 now previous target
A 0 3-1-2013 3-9-2011 0.1
A 1 1-1-2014 4-9-2011 1.7
to eventually have:
col1 col2 now previous target sum
A 1 1-1-2015 4-1-2014 0.2 1.8
Answers:
Interesting problem, I’ve got something that I think may work.
Although, slow time complexity of Worst case: O(n**3)
and Best case: O(n**2)
.
Setup data
import pandas as pd
import numpy as np
import io
datastring = io.StringIO(
"""
col1 col2 now previous target
A 1 1-1-2015 4-1-2014 0.2
B 0 2-1-2015 2-5-2014 0.33
A 0 3-1-2013 3-9-2011 0.1
A 1 1-1-2014 4-9-2011 1.7
A 1 31-12-2014 4-9-2014 1.9
C 1 31-12-2014 4-9-2014 1.9
""")
# arguments for pandas.read_csv
kwargs = {
"sep": "s+", # specifies that it's a space separated file
"parse_dates": [2,3], # parse "now" and "previous" as dates
}
# read the csv into a pandas dataframe
df = pd.read_csv(datastring, **kwargs)
Pseudo code for algorithm
For each row:
For each *other* row:
If "now" of *other* row comes before "previous" of row
Then add *other* rows "target" to "sum" of row
Run the algorithm
First start by setting up a function f()
, that is to be applied over all the groups computed by df.groupby(["col1","col2"])
. All that f()
does is try to implement the pseudo code above.
def f(df):
_sum = np.zeros(len(df))
# represent the desired columns of the sub-dataframe as a numpy object
data = df[["now","previous","target"]].values
# loop through the rows in the sub-dataframe, df
for i, outer_row in enumerate(data):
# for each row, loop through all the rows again
for j, inner_row in enumerate(data):
# skip iteration if outer loop row is equal to the inner loop row
if i==j: continue
# get the dates from rows
outer_prev = outer_row[1]
inner_now = inner_row[0]
# if the "previous" datetime of the outer loop is greater than
# the "now" datetime of the inner loop, then add "target" to
# to the cumulative sum
if outer_prev > inner_now:
_sum[i] += inner_row[2]
# add a new column for this new "sum" that we calculated
df["sum"] = _sum
return df
Now just apply f()
over the grouped data.
done = df.groupby(["col1","col2"]).apply(f)
Output
col1 col2 now previous target sum
0 A 1 2015-01-01 2014-04-01 0.20 1.7
1 B 0 2015-02-01 2014-02-05 0.33 0.0
2 A 0 2013-03-01 2011-03-09 0.10 0.0
3 A 1 2014-01-01 2011-04-09 1.70 0.0
4 A 1 2014-12-31 2014-04-09 1.90 1.7
I have a df
which looks like that:
col1 col2 now previous target
A 1 1-1-2015 4-1-2014 0.2
B 0 2-1-2015 2-5-2014 0.33
A 0 3-1-2013 3-9-2011 0.1
A 1 1-1-2014 4-9-2011 1.7
A 1 31-12-2014 4-9-2014 1.9
I am grouping the df
by col1 and col2, and for each member of each group, I want to sum the target
values, only of other group members, that their now
date value, is smaller(before) than the current member’s previous
date value.
For example for:
col1 col2 now previous target
A 1 1-1-2015 4-1-2014 0.2
I want to sum the target values of:
col1 col2 now previous target
A 0 3-1-2013 3-9-2011 0.1
A 1 1-1-2014 4-9-2011 1.7
to eventually have:
col1 col2 now previous target sum
A 1 1-1-2015 4-1-2014 0.2 1.8
Interesting problem, I’ve got something that I think may work.
Although, slow time complexity of Worst case: O(n**3)
and Best case: O(n**2)
.
Setup data
import pandas as pd
import numpy as np
import io
datastring = io.StringIO(
"""
col1 col2 now previous target
A 1 1-1-2015 4-1-2014 0.2
B 0 2-1-2015 2-5-2014 0.33
A 0 3-1-2013 3-9-2011 0.1
A 1 1-1-2014 4-9-2011 1.7
A 1 31-12-2014 4-9-2014 1.9
C 1 31-12-2014 4-9-2014 1.9
""")
# arguments for pandas.read_csv
kwargs = {
"sep": "s+", # specifies that it's a space separated file
"parse_dates": [2,3], # parse "now" and "previous" as dates
}
# read the csv into a pandas dataframe
df = pd.read_csv(datastring, **kwargs)
Pseudo code for algorithm
For each row:
For each *other* row:
If "now" of *other* row comes before "previous" of row
Then add *other* rows "target" to "sum" of row
Run the algorithm
First start by setting up a function f()
, that is to be applied over all the groups computed by df.groupby(["col1","col2"])
. All that f()
does is try to implement the pseudo code above.
def f(df):
_sum = np.zeros(len(df))
# represent the desired columns of the sub-dataframe as a numpy object
data = df[["now","previous","target"]].values
# loop through the rows in the sub-dataframe, df
for i, outer_row in enumerate(data):
# for each row, loop through all the rows again
for j, inner_row in enumerate(data):
# skip iteration if outer loop row is equal to the inner loop row
if i==j: continue
# get the dates from rows
outer_prev = outer_row[1]
inner_now = inner_row[0]
# if the "previous" datetime of the outer loop is greater than
# the "now" datetime of the inner loop, then add "target" to
# to the cumulative sum
if outer_prev > inner_now:
_sum[i] += inner_row[2]
# add a new column for this new "sum" that we calculated
df["sum"] = _sum
return df
Now just apply f()
over the grouped data.
done = df.groupby(["col1","col2"]).apply(f)
Output
col1 col2 now previous target sum
0 A 1 2015-01-01 2014-04-01 0.20 1.7
1 B 0 2015-02-01 2014-02-05 0.33 0.0
2 A 0 2013-03-01 2011-03-09 0.10 0.0
3 A 1 2014-01-01 2011-04-09 1.70 0.0
4 A 1 2014-12-31 2014-04-09 1.90 1.7