Python dataframe groupby multiple columns with conditional sum

Question:

I have a df which looks like that:

col1    col2       now        previous      target
 A        1      1-1-2015     4-1-2014       0.2
 B        0      2-1-2015     2-5-2014       0.33
 A        0      3-1-2013     3-9-2011       0.1
 A        1      1-1-2014     4-9-2011       1.7
 A        1      31-12-2014   4-9-2014       1.9

I am grouping the df by col1 and col2, and for each member of each group, I want to sum the target values, only of other group members, that their now date value, is smaller(before) than the current member’s previous date value.

For example for:

col1    col2       now        previous      target
 A        1      1-1-2015     4-1-2014       0.2

I want to sum the target values of:

col1    col2       now        previous      target
 A        0      3-1-2013     3-9-2011       0.1
 A        1      1-1-2014     4-9-2011       1.7

to eventually have:

col1    col2       now        previous      target    sum
 A        1      1-1-2015     4-1-2014       0.2      1.8
Asked By: Binyamin Even

||

Answers:

Interesting problem, I’ve got something that I think may work.

Although, slow time complexity of Worst case: O(n**3) and Best case: O(n**2).

Setup data

import pandas as pd
import numpy as np
import io

datastring = io.StringIO(
"""
col1    col2       now        previous      target
 A        1      1-1-2015     4-1-2014       0.2
 B        0      2-1-2015     2-5-2014       0.33
 A        0      3-1-2013     3-9-2011       0.1
 A        1      1-1-2014     4-9-2011       1.7
 A        1      31-12-2014   4-9-2014       1.9
 C        1      31-12-2014   4-9-2014       1.9
""")
# arguments for pandas.read_csv
kwargs = {
    "sep": "s+", # specifies that it's a space separated file
    "parse_dates": [2,3], # parse "now" and "previous" as dates
    }
# read the csv into a pandas dataframe
df = pd.read_csv(datastring, **kwargs)

Pseudo code for algorithm

For each row:
    For each *other* row:
        If "now" of *other* row comes before "previous" of row
        Then add *other* rows "target" to "sum" of row

Run the algorithm

First start by setting up a function f(), that is to be applied over all the groups computed by df.groupby(["col1","col2"]). All that f() does is try to implement the pseudo code above.

def f(df):
    _sum = np.zeros(len(df))
    # represent the desired columns of the sub-dataframe as a numpy object
    data = df[["now","previous","target"]].values
    # loop through the rows in the sub-dataframe, df
    for i, outer_row in enumerate(data):
        # for each row, loop through all the rows again
        for j, inner_row in enumerate(data):
            # skip iteration if outer loop row is equal to the inner loop row
            if i==j: continue
            # get the dates from rows
            outer_prev = outer_row[1]
            inner_now = inner_row[0]
            # if the "previous" datetime of the outer loop is greater than
            # the "now" datetime of the inner loop, then add "target" to 
            # to the cumulative sum
            if outer_prev > inner_now: 
                _sum[i] += inner_row[2]
    # add a new column for this new "sum" that we calculated
    df["sum"] = _sum
    return df

Now just apply f() over the grouped data.

done = df.groupby(["col1","col2"]).apply(f)

Output

  col1  col2        now   previous  target  sum
0    A     1 2015-01-01 2014-04-01    0.20  1.7
1    B     0 2015-02-01 2014-02-05    0.33  0.0
2    A     0 2013-03-01 2011-03-09    0.10  0.0
3    A     1 2014-01-01 2011-04-09    1.70  0.0
4    A     1 2014-12-31 2014-04-09    1.90  1.7
Answered By: Filip Kilibarda
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.