New Column based on aggregation from different df with specific conditions
Question:
I have two data frames:
df1
includes rows with a date
df2
includes rows with type and date
I would like to create column "b"
in df1
, that is a list of all types (including duplicates) of df2
with df2.date
less then df1.date
Example:
df1
has a row with date 2023-01-01
df2
has three rows:
- one of type AAA with date
2022-01-01
- second of type AAA with date
2022-01-02
- third of type BBB with date
2023-02-02
- result is
['AAA','AAA']
Answers:
First, let’s create a minimal reproducible example:
from string import ascii_uppercase
def gen(n):
t0 = pd.Timestamp('2000')
t1 = (t0 + pd.to_timedelta(np.random.randint(0, 6000, n), unit='D'))
t2 = (t0 + pd.to_timedelta(np.random.randint(0, 6000, n), unit='D'))
type_ = [x*3 for x in np.random.choice(list(ascii_uppercase), n)]
df1 = pd.DataFrame({'date': t1})
df2 = pd.DataFrame({'date': t2, 'type': type_})
return df1, df2
Example:
np.random.seed(0) # reproducible example
df1, df2 = gen(8)
>>> df1
date
0 2007-06-25
1 2007-02-20
2 2004-07-11
3 2008-12-08
4 2013-07-02
5 2013-04-21
6 2015-12-15
7 2002-10-30
>>> df2
date type
0 2011-12-22 YYY
1 2016-01-31 RRR
2 2016-03-21 FFF
3 2009-06-30 ZZZ
4 2001-12-06 NNN
5 2007-02-12 III
6 2005-11-05 JJJ
7 2006-01-31 UUU
Then, we concat the two df
s, sort by date, and calculate a "cumulative list" of 'type'
(dropping the NaN
that come from df1
). Also, note the added index lavel 'k'
, which will allow us very quickly to retrieve the rows from df1
.
def as_list(s):
# empty list for NaN
return [s] if s == s else []
z = pd.concat([df1, df2], keys=[0, 1], names=['k']).sort_values(['date', 'k'])
cumlist = z['type'].apply(as_list).cumsum()
newdf1 = df1.assign(b=cumlist.loc[0])
>>> newdf1
date b
0 2007-06-25 [NNN, JJJ, UUU, III]
1 2007-02-20 [NNN, JJJ, UUU, III]
2 2004-07-11 [NNN]
3 2008-12-08 [NNN, JJJ, UUU, III]
4 2013-07-02 [NNN, JJJ, UUU, III, ZZZ, YYY]
5 2013-04-21 [NNN, JJJ, UUU, III, ZZZ, YYY]
6 2015-12-15 [NNN, JJJ, UUU, III, ZZZ, YYY]
7 2002-10-30 [NNN]
Explanation
Let’s look at the content of z
above:
>>> z
date type
k
1 4 2001-12-06 NNN
0 7 2002-10-30 NaN
2 2004-07-11 NaN
1 6 2005-11-05 JJJ
7 2006-01-31 UUU
.. ... ...
0 5 2013-04-21 NaN
4 2013-07-02 NaN
6 2015-12-15 NaN
1 1 2016-01-31 RRR
2 2016-03-21 FFF
It contains the type
from df2
, and NaN
for the df1
rows. It also has k==0
level-0 index for df1
rows. It is sorted by date
, and also by k
(to break the ties if any date is present in both dfs: enforce rows from df1
to come first, as we want type
s from df2
for dates before the dates in df1
).
Then, let’s look at cumlist
:
>>> cumlist
k
1 4 [NNN]
0 7 [NNN]
2 [NNN]
1 6 [NNN, JJJ]
7 [NNN, JJJ, UUU]
...
0 5 [NNN, JJJ, UUU, III, ZZZ, YYY]
4 [NNN, JJJ, UUU, III, ZZZ, YYY]
6 [NNN, JJJ, UUU, III, ZZZ, YYY]
1 1 [NNN, JJJ, UUU, III, ZZZ, YYY, RRR]
2 [NNN, JJJ, UUU, III, ZZZ, YYY, RRR, FFF]
For k==0
, these are the lists we want.
Addendum: new ‘ID’ columns
Say both DataFrames also have an ID
column, and we want to perform the same operation as above, but separately for each ID
.
The solution becomes:
z = pd.concat([df1, df2], keys=[0, 1], names=['k']).sort_values(['ID', 'date', 'k'])
cumlist = z.assign(
type=z['type'].apply(as_list)
).groupby('ID', group_keys=False)['type'].apply(pd.Series.cumsum)
newdf1 = df1.assign(b=cumlist.loc[0])
Example:
def gen(n, n_id=4):
t0 = pd.Timestamp('2000')
t1 = (t0 + pd.to_timedelta(np.random.randint(0, 6000, n), unit='D'))
t2 = (t0 + pd.to_timedelta(np.random.randint(0, 6000, n), unit='D'))
type_ = [x*3 for x in np.random.choice(list(ascii_uppercase), n)]
ids = np.repeat(np.arange(n_id), n // n_id + 1)[:n]
df1 = pd.DataFrame({'date': t1, 'ID': ids})
df2 = pd.DataFrame({'date': t2, 'ID': ids, 'type': type_})
return df1, df2
np.random.seed(0) # reproducible example
df1, df2 = gen(8, 2)
# code above to get newdf1
>>> newdf1
date ID b
0 2007-06-25 0 [NNN]
1 2007-02-20 0 [NNN]
2 2004-07-11 0 [NNN]
3 2008-12-08 0 [NNN]
4 2013-07-02 0 [NNN, ZZZ, YYY]
5 2013-04-21 1 [JJJ, UUU, III]
6 2015-12-15 1 [JJJ, UUU, III]
7 2002-10-30 1 []
I have two data frames:
df1
includes rows with a datedf2
includes rows with type and date
I would like to create column "b"
in df1
, that is a list of all types (including duplicates) of df2
with df2.date
less then df1.date
Example:
df1
has a row with date2023-01-01
df2
has three rows:- one of type AAA with date
2022-01-01
- second of type AAA with date
2022-01-02
- third of type BBB with date
2023-02-02
- one of type AAA with date
- result is
['AAA','AAA']
First, let’s create a minimal reproducible example:
from string import ascii_uppercase
def gen(n):
t0 = pd.Timestamp('2000')
t1 = (t0 + pd.to_timedelta(np.random.randint(0, 6000, n), unit='D'))
t2 = (t0 + pd.to_timedelta(np.random.randint(0, 6000, n), unit='D'))
type_ = [x*3 for x in np.random.choice(list(ascii_uppercase), n)]
df1 = pd.DataFrame({'date': t1})
df2 = pd.DataFrame({'date': t2, 'type': type_})
return df1, df2
Example:
np.random.seed(0) # reproducible example
df1, df2 = gen(8)
>>> df1
date
0 2007-06-25
1 2007-02-20
2 2004-07-11
3 2008-12-08
4 2013-07-02
5 2013-04-21
6 2015-12-15
7 2002-10-30
>>> df2
date type
0 2011-12-22 YYY
1 2016-01-31 RRR
2 2016-03-21 FFF
3 2009-06-30 ZZZ
4 2001-12-06 NNN
5 2007-02-12 III
6 2005-11-05 JJJ
7 2006-01-31 UUU
Then, we concat the two df
s, sort by date, and calculate a "cumulative list" of 'type'
(dropping the NaN
that come from df1
). Also, note the added index lavel 'k'
, which will allow us very quickly to retrieve the rows from df1
.
def as_list(s):
# empty list for NaN
return [s] if s == s else []
z = pd.concat([df1, df2], keys=[0, 1], names=['k']).sort_values(['date', 'k'])
cumlist = z['type'].apply(as_list).cumsum()
newdf1 = df1.assign(b=cumlist.loc[0])
>>> newdf1
date b
0 2007-06-25 [NNN, JJJ, UUU, III]
1 2007-02-20 [NNN, JJJ, UUU, III]
2 2004-07-11 [NNN]
3 2008-12-08 [NNN, JJJ, UUU, III]
4 2013-07-02 [NNN, JJJ, UUU, III, ZZZ, YYY]
5 2013-04-21 [NNN, JJJ, UUU, III, ZZZ, YYY]
6 2015-12-15 [NNN, JJJ, UUU, III, ZZZ, YYY]
7 2002-10-30 [NNN]
Explanation
Let’s look at the content of z
above:
>>> z
date type
k
1 4 2001-12-06 NNN
0 7 2002-10-30 NaN
2 2004-07-11 NaN
1 6 2005-11-05 JJJ
7 2006-01-31 UUU
.. ... ...
0 5 2013-04-21 NaN
4 2013-07-02 NaN
6 2015-12-15 NaN
1 1 2016-01-31 RRR
2 2016-03-21 FFF
It contains the type
from df2
, and NaN
for the df1
rows. It also has k==0
level-0 index for df1
rows. It is sorted by date
, and also by k
(to break the ties if any date is present in both dfs: enforce rows from df1
to come first, as we want type
s from df2
for dates before the dates in df1
).
Then, let’s look at cumlist
:
>>> cumlist
k
1 4 [NNN]
0 7 [NNN]
2 [NNN]
1 6 [NNN, JJJ]
7 [NNN, JJJ, UUU]
...
0 5 [NNN, JJJ, UUU, III, ZZZ, YYY]
4 [NNN, JJJ, UUU, III, ZZZ, YYY]
6 [NNN, JJJ, UUU, III, ZZZ, YYY]
1 1 [NNN, JJJ, UUU, III, ZZZ, YYY, RRR]
2 [NNN, JJJ, UUU, III, ZZZ, YYY, RRR, FFF]
For k==0
, these are the lists we want.
Addendum: new ‘ID’ columns
Say both DataFrames also have an ID
column, and we want to perform the same operation as above, but separately for each ID
.
The solution becomes:
z = pd.concat([df1, df2], keys=[0, 1], names=['k']).sort_values(['ID', 'date', 'k'])
cumlist = z.assign(
type=z['type'].apply(as_list)
).groupby('ID', group_keys=False)['type'].apply(pd.Series.cumsum)
newdf1 = df1.assign(b=cumlist.loc[0])
Example:
def gen(n, n_id=4):
t0 = pd.Timestamp('2000')
t1 = (t0 + pd.to_timedelta(np.random.randint(0, 6000, n), unit='D'))
t2 = (t0 + pd.to_timedelta(np.random.randint(0, 6000, n), unit='D'))
type_ = [x*3 for x in np.random.choice(list(ascii_uppercase), n)]
ids = np.repeat(np.arange(n_id), n // n_id + 1)[:n]
df1 = pd.DataFrame({'date': t1, 'ID': ids})
df2 = pd.DataFrame({'date': t2, 'ID': ids, 'type': type_})
return df1, df2
np.random.seed(0) # reproducible example
df1, df2 = gen(8, 2)
# code above to get newdf1
>>> newdf1
date ID b
0 2007-06-25 0 [NNN]
1 2007-02-20 0 [NNN]
2 2004-07-11 0 [NNN]
3 2008-12-08 0 [NNN]
4 2013-07-02 0 [NNN, ZZZ, YYY]
5 2013-04-21 1 [JJJ, UUU, III]
6 2015-12-15 1 [JJJ, UUU, III]
7 2002-10-30 1 []