New Column based on aggregation from different df with specific conditions

Question:

I have two data frames:

  • df1 includes rows with a date
  • df2 includes rows with type and date

I would like to create column "b" in df1, that is a list of all types (including duplicates) of df2 with df2.date less then df1.date

Example:

  • df1 has a row with date 2023-01-01
  • df2 has three rows:
    • one of type AAA with date 2022-01-01
    • second of type AAA with date 2022-01-02
    • third of type BBB with date 2023-02-02
  • result is ['AAA','AAA']
Asked By: vertigo

||

Answers:

First, let’s create a minimal reproducible example:

from string import ascii_uppercase

def gen(n):
    t0 = pd.Timestamp('2000')
    t1 = (t0 + pd.to_timedelta(np.random.randint(0, 6000, n), unit='D'))
    t2 = (t0 + pd.to_timedelta(np.random.randint(0, 6000, n), unit='D'))
    type_ = [x*3 for x in np.random.choice(list(ascii_uppercase), n)]
    df1 = pd.DataFrame({'date': t1})
    df2 = pd.DataFrame({'date': t2, 'type': type_})
    return df1, df2

Example:

np.random.seed(0)  # reproducible example
df1, df2 = gen(8)

>>> df1
        date
0 2007-06-25
1 2007-02-20
2 2004-07-11
3 2008-12-08
4 2013-07-02
5 2013-04-21
6 2015-12-15
7 2002-10-30

>>> df2
        date type
0 2011-12-22  YYY
1 2016-01-31  RRR
2 2016-03-21  FFF
3 2009-06-30  ZZZ
4 2001-12-06  NNN
5 2007-02-12  III
6 2005-11-05  JJJ
7 2006-01-31  UUU

Then, we concat the two dfs, sort by date, and calculate a "cumulative list" of 'type' (dropping the NaN that come from df1). Also, note the added index lavel 'k', which will allow us very quickly to retrieve the rows from df1.

def as_list(s):
    # empty list for NaN
    return [s] if s == s else []

z = pd.concat([df1, df2], keys=[0, 1], names=['k']).sort_values(['date', 'k'])
cumlist = z['type'].apply(as_list).cumsum()
newdf1 = df1.assign(b=cumlist.loc[0])

>>> newdf1
        date                               b
0 2007-06-25            [NNN, JJJ, UUU, III]
1 2007-02-20            [NNN, JJJ, UUU, III]
2 2004-07-11                           [NNN]
3 2008-12-08            [NNN, JJJ, UUU, III]
4 2013-07-02  [NNN, JJJ, UUU, III, ZZZ, YYY]
5 2013-04-21  [NNN, JJJ, UUU, III, ZZZ, YYY]
6 2015-12-15  [NNN, JJJ, UUU, III, ZZZ, YYY]
7 2002-10-30                           [NNN]

Explanation

Let’s look at the content of z above:

>>> z
          date type
k                  
1 4 2001-12-06  NNN
0 7 2002-10-30  NaN
  2 2004-07-11  NaN
1 6 2005-11-05  JJJ
  7 2006-01-31  UUU
..         ...  ...
0 5 2013-04-21  NaN
  4 2013-07-02  NaN
  6 2015-12-15  NaN
1 1 2016-01-31  RRR
  2 2016-03-21  FFF

It contains the type from df2, and NaN for the df1 rows. It also has k==0 level-0 index for df1 rows. It is sorted by date, and also by k (to break the ties if any date is present in both dfs: enforce rows from df1 to come first, as we want types from df2 for dates before the dates in df1).

Then, let’s look at cumlist:

>>> cumlist
k   
1  4                                       [NNN]
0  7                                       [NNN]
   2                                       [NNN]
1  6                                  [NNN, JJJ]
   7                             [NNN, JJJ, UUU]
                          ...                   
0  5              [NNN, JJJ, UUU, III, ZZZ, YYY]
   4              [NNN, JJJ, UUU, III, ZZZ, YYY]
   6              [NNN, JJJ, UUU, III, ZZZ, YYY]
1  1         [NNN, JJJ, UUU, III, ZZZ, YYY, RRR]
   2    [NNN, JJJ, UUU, III, ZZZ, YYY, RRR, FFF]

For k==0, these are the lists we want.

Addendum: new ‘ID’ columns

Say both DataFrames also have an ID column, and we want to perform the same operation as above, but separately for each ID.

The solution becomes:

z = pd.concat([df1, df2], keys=[0, 1], names=['k']).sort_values(['ID', 'date', 'k'])
cumlist = z.assign(
    type=z['type'].apply(as_list)
).groupby('ID', group_keys=False)['type'].apply(pd.Series.cumsum)
newdf1 = df1.assign(b=cumlist.loc[0])

Example:

def gen(n, n_id=4):
    t0 = pd.Timestamp('2000')
    t1 = (t0 + pd.to_timedelta(np.random.randint(0, 6000, n), unit='D'))
    t2 = (t0 + pd.to_timedelta(np.random.randint(0, 6000, n), unit='D'))
    type_ = [x*3 for x in np.random.choice(list(ascii_uppercase), n)]
    ids = np.repeat(np.arange(n_id), n // n_id + 1)[:n]
    df1 = pd.DataFrame({'date': t1, 'ID': ids})
    df2 = pd.DataFrame({'date': t2, 'ID': ids, 'type': type_})
    return df1, df2

np.random.seed(0)  # reproducible example
df1, df2 = gen(8, 2)

# code above to get newdf1

>>> newdf1
        date  ID                b
0 2007-06-25   0            [NNN]
1 2007-02-20   0            [NNN]
2 2004-07-11   0            [NNN]
3 2008-12-08   0            [NNN]
4 2013-07-02   0  [NNN, ZZZ, YYY]
5 2013-04-21   1  [JJJ, UUU, III]
6 2015-12-15   1  [JJJ, UUU, III]
7 2002-10-30   1               []
Answered By: Pierre D
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.