Python pandas two table match to find latest date
Question:
I want to do some matching in pandas like Vlookup in Excel. According to some conditions in Table1, find the latest date in Table2:
Table 1:
Name Threshold1 Threshold2
A 9 8
B 14 13
Table 2:
Date Name Value
1/1 A 10
1/2 A 9
1/3 A 9
1/4 A 8
1/5 A 8
1/1 B 15
1/2 B 14
1/3 B 14
1/4 B 13
1/5 B 13
The desired table is like:
Name Threshold1 Threshold1_Date Threshold2 Threshold2_Date
A 9 1/3 8 1/5
B 14 1/3 13 1/5
Thanks in advance!
Answers:
Does this work?
(df_out := df1.melt('Name', value_name='Value')
.merge(df2, on=['Name', 'Value'])
.sort_values('Date')
.drop_duplicates(['Name', 'variable'], keep='last')
.set_index(['Name', 'variable'])
.unstack().sort_index(level=1, axis=1))
.set_axis(df_out.columns.map('_'.join), axis=1).reset_index()
Output:
Name Date_Threshold1 Value_Threshold1 Date_Threshold2 Value_Threshold2
0 A 1/3 9 1/5 8
1 B 1/3 14 1/5 13
Code
# assuming dataframe is already sorted on `date`
# drop the duplicates per Name and Value keeping the max date
cols = ['Name', 'Value']
s = df2.drop_duplicates(cols, keep='last').set_index(cols)['Date']
# for each threshold column use multindex.map to substitute
# values from df2 based on matching Name and Threshold value
for c in df1.filter(like='Threshold'):
df1[c + '_date'] = df1.set_index(['Name', c]).index.map(s)
Result
Name Threshold1 Threshold2 Threshold1_date Threshold2_date
0 A 9 8 1/3 1/5
1 B 14 13 1/3 1/5
Here’s a way to do what your question asks:
latestDtByNameVal = df2.groupby(['Name','Value']).last()
res = df1.assign(**( df1.set_index('Name').pipe(lambda d:
{f'{col}_Date': d[[col]].rename(columns={col:'Value'})
.set_index('Value', append=True)
.pipe(lambda d:latestDtByNameVal.Date[d.index].to_numpy())
for col in d.columns}) ))
If you want the result columns to be ordered as in your question, you can add one of the following:
# use numpy ravel:
res = res[np.ravel([[x + s for x in df1.columns if x != 'Name'] for s in ['','_Date']], order='F')]
# ... or, use itertools:
from itertools import chain
res = res[['Name'] + list(chain.from_iterable([[col, f'{col}_Date'] for col in df1.drop(columns='Name').columns]))]
Output:
Name Threshold1 Threshold1_Date Threshold2 Threshold2_Date
0 A 9 1/3 8 1/5
1 B 14 1/3 13 1/5
I want to do some matching in pandas like Vlookup in Excel. According to some conditions in Table1, find the latest date in Table2:
Table 1:
Name Threshold1 Threshold2
A 9 8
B 14 13
Table 2:
Date Name Value
1/1 A 10
1/2 A 9
1/3 A 9
1/4 A 8
1/5 A 8
1/1 B 15
1/2 B 14
1/3 B 14
1/4 B 13
1/5 B 13
The desired table is like:
Name Threshold1 Threshold1_Date Threshold2 Threshold2_Date
A 9 1/3 8 1/5
B 14 1/3 13 1/5
Thanks in advance!
Does this work?
(df_out := df1.melt('Name', value_name='Value')
.merge(df2, on=['Name', 'Value'])
.sort_values('Date')
.drop_duplicates(['Name', 'variable'], keep='last')
.set_index(['Name', 'variable'])
.unstack().sort_index(level=1, axis=1))
.set_axis(df_out.columns.map('_'.join), axis=1).reset_index()
Output:
Name Date_Threshold1 Value_Threshold1 Date_Threshold2 Value_Threshold2
0 A 1/3 9 1/5 8
1 B 1/3 14 1/5 13
Code
# assuming dataframe is already sorted on `date`
# drop the duplicates per Name and Value keeping the max date
cols = ['Name', 'Value']
s = df2.drop_duplicates(cols, keep='last').set_index(cols)['Date']
# for each threshold column use multindex.map to substitute
# values from df2 based on matching Name and Threshold value
for c in df1.filter(like='Threshold'):
df1[c + '_date'] = df1.set_index(['Name', c]).index.map(s)
Result
Name Threshold1 Threshold2 Threshold1_date Threshold2_date
0 A 9 8 1/3 1/5
1 B 14 13 1/3 1/5
Here’s a way to do what your question asks:
latestDtByNameVal = df2.groupby(['Name','Value']).last()
res = df1.assign(**( df1.set_index('Name').pipe(lambda d:
{f'{col}_Date': d[[col]].rename(columns={col:'Value'})
.set_index('Value', append=True)
.pipe(lambda d:latestDtByNameVal.Date[d.index].to_numpy())
for col in d.columns}) ))
If you want the result columns to be ordered as in your question, you can add one of the following:
# use numpy ravel:
res = res[np.ravel([[x + s for x in df1.columns if x != 'Name'] for s in ['','_Date']], order='F')]
# ... or, use itertools:
from itertools import chain
res = res[['Name'] + list(chain.from_iterable([[col, f'{col}_Date'] for col in df1.drop(columns='Name').columns]))]
Output:
Name Threshold1 Threshold1_Date Threshold2 Threshold2_Date
0 A 9 1/3 8 1/5
1 B 14 1/3 13 1/5