Query for one dataframe row based on row in another dataframe & compare values
Question:
So I have two data frames. The first data frame contains numerical data that is used to "score" the second data frame which contains simulation data.
df1 = base records
df2 = simulation records
Part 1: What I am trying to accomplish is to query df1 ‘base records’ to find the row that has the most recent timestamp to that in the df2 ‘simulation records’ where the "Name" & "Time" columns match exactly.
Part 2: Then I want to use an if then function to determine whether a value in the simulation record row fall between a range created using two values from the base record row and return a boolean.
low range = df1[‘Po’]-df1[‘Ref’]
high range = df1[‘Po’]+df1[‘Ref’]
if df2[‘Sim’] falls in between the low range & high range of its most recent df1 base record then I want to return true in the new column "Sim Score"
otherwise return false
Part 3: I want to repeat Part 1 & Part 2 for each row in the simulation records.
helpful information:
- df1 (base records) have more or less rows than df2 (simulation records)
- df1 has more columns than df2
- some columns in df1 have the same name but different values in df2
- ideally want to be able to slice both dataframes where the if then function only sees the two rows used in the comparison
- only need the most recent df1 base record to compare to the df2 simulation record
- previously accomplished this in google sheets with if then & query combination formula dragged down entire sheet (want to replace with python & pandas)
df1 base records example (columns that matter)
Timestamp Name Time Po Ref
7/11/2022 11:30:00 trial 20 mins 5 2
7/10/2022 04:00:00 trial 20 mins 4 4
7/09/2022 02:45:00 trial 20 mins 2 2
6/28/2022 03:45:00 trial 20 mins 3 6
df2 simulation records example (columns that matter)
Timestamp Name Time Sim
7/10/2022 05:15:00 trial 20 mins 7
7/11/2022 12:45:00 trial 20 mins 4
7/12/2022 03:30:00 trial 20 mins 8
desired result of new column added to df2
Timestamp Name Time Sim Sim Score
7/10/2022 05:15:00 trial 20 mins 7 True
7/11/2022 12:45:00 trial 20 mins 4 True
7/12/2022 03:30:00 trial 20 mins 8 False
Answers:
Because you don’t provide code to construct the dataframe, I will sketch a potential solution:
First, I will assume that you have only one timestamp per day (which it looks like in your examples). Accordingly, I would truncate or split the timestamp to only have the date in one column. This is done so we can join the dataframes based on the date, i.e. use set_index("date_column")
for both dataframes (use an inner-join to only keep the rows where the date was present in both dataframes). Finally, you can use apply()
to check your condition:
df_joined['Sim Score'] = df_joined.apply(lambda row: (row['Po']-row['Ref'] <= row['Sim']) and (row['Po']+row['Ref'] >= row['Sim']), axis = 1)
Use pandas.DataFrame.reindex
, its method
offer nearest to find the computable index(e.g., string can not calculate distance)
Or use merge_asof
, its direction
offer nearest.
Method 1:
reindex()
with method='nearest'
df1['Timestamp'] = pd.to_datetime(df1['Timestamp'])
df1.set_index('Timestamp', inplace=True)
df1['l_r'] = df1['Po'] - df1['Ref']
df1['h_r'] = df1['Po'] + df1['Ref']
print(df1)
###
Name Time Po Ref l_r h_r
Timestamp
2022-07-11 11:30:00 trial 20 mins 5 2 3 7
2022-07-10 04:00:00 trial 20 mins 4 4 0 8
2022-07-09 02:45:00 trial 20 mins 2 2 0 4
2022-06-28 03:45:00 trial 20 mins 3 6 -3 9
df2['Timestamp'] = pd.to_datetime(df2['Timestamp'])
df2.set_index('Timestamp', inplace=True)
print(df2)
###
Name Time Sim
Timestamp
2022-07-10 05:15:00 trial 20 mins 7
2022-07-11 12:45:00 trial 20 mins 4
2022-07-12 03:30:00 trial 20 mins 8
temp = df2.join(df1.reindex(df2.index, method='nearest'), lsuffix='_left', rsuffix='_right')
print(temp)
As you can see, this is df2.join(df1)
,
join multiple DataFrame objects by index at once.
with method='nearest'
, in this case, it would join df2
and df1
by the nearest Timestamp
index.
df2['Sim Score'] = temp['Sim'].between(temp['l_r'], temp['h_r']).values
df2.reset_index(inplace=True)
print(df2)
###
Timestamp Name Time Sim Sim Score
0 2022-07-10 05:15:00 trial 20 mins 7 True
1 2022-07-11 12:45:00 trial 20 mins 4 True
2 2022-07-12 03:30:00 trial 20 mins 8 False
Method 2:
merge_asof()
with direction='nearest'
this way is not executed with indexed value, therefore we don’t have to set column Timestamp
as index. But it needs binding objects(in this case we merge on column Timestamp
)sorted.
df1['Timestamp'] = pd.to_datetime(df1['Timestamp'])
# df1.set_index('Timestamp', inplace=True)
df1['l_r'] = df1['Po'] - df1['Ref']
df1['h_r'] = df1['Po'] + df1['Ref']
df1.sort_values(by='Timestamp', inplace=True)
print(df1)
###
Timestamp Name Time Po Ref l_r h_r
3 2022-06-28 03:45:00 trial 20 mins 3 6 -3 9
2 2022-07-09 02:45:00 trial 20 mins 2 2 0 4
1 2022-07-10 04:00:00 trial 20 mins 4 4 0 8
0 2022-07-11 11:30:00 trial 20 mins 5 2 3 7
df2['Timestamp'] = pd.to_datetime(df2['Timestamp'])
# df2.set_index('Timestamp', inplace=True)
df2.sort_values(by='Timestamp', inplace=True)
print(df2)
###
Timestamp Name Time Sim
0 2022-07-10 05:15:00 trial 20 mins 7
1 2022-07-11 12:45:00 trial 20 mins 4
2 2022-07-12 03:30:00 trial 20 mins 8
temp = pd.merge_asof(df2 ,df1[['Timestamp', 'l_r', 'h_r']], on='Timestamp', direction='nearest')
print(temp)
As you can see, this is pd.merge_asof(df2, df1)
,
This is similar to a left-join except that we match on nearest key rather than equal keys. Both DataFrames must be sorted by the key.
For each row in the left DataFrame:
A “nearest” search selects the row in the right DataFrame whose ‘on’ key is closest in absolute distance to the left’s key.
df2['Sim Score'] = temp['Sim'].between(temp['l_r'], temp['h_r']).values
print(df2)
###
Timestamp Name Time Sim Sim Score
0 2022-07-10 05:15:00 trial 20 mins 7 True
1 2022-07-11 12:45:00 trial 20 mins 4 True
2 2022-07-12 03:30:00 trial 20 mins 8 False
Frankly speaking, working on indexed things would be faster if you have a large dataset.
Method 2 (on multiple keys)
I remodified df1
adding different Name and Time
df1 = pd.DataFrame({'Timestamp':['7/11/2022 11:30:00','7/11/2022 11:30:00','7/10/2022 04:00:00','7/10/2022 04:00:00','7/09/2022 02:45:00','6/28/2022 03:45:00'],
'Name':['trial','trial','trial','non-trial','trial','trial'],
'Time':['20 mins','30 mins','20 mins','20 mins','20 mins','20 mins'],
'Po':[5, 6, 4, 1, 2, 3],
'Ref':[2, 2, 4, 3, 2, 6]})
df1['Timestamp'] = pd.to_datetime(df1['Timestamp'])
df1['l_r'] = df1['Po'] - df1['Ref']
df1['h_r'] = df1['Po'] + df1['Ref']
df1.sort_values(by='Timestamp', inplace=True)
print(df1)
###
Timestamp Name Time Po Ref l_r h_r
5 2022-06-28 03:45:00 trial 20 mins 3 6 -3 9
4 2022-07-09 02:45:00 trial 20 mins 2 2 0 4
2 2022-07-10 04:00:00 trial 20 mins 4 4 0 8
3 2022-07-10 04:00:00 non-trial 20 mins 1 3 -2 4
0 2022-07-11 11:30:00 trial 20 mins 5 2 3 7
1 2022-07-11 11:30:00 trial 30 mins 6 2 4 8
print(df2)
###
Timestamp Name Time Sim
0 2022-07-10 05:15:00 trial 20 mins 7
1 2022-07-11 12:45:00 trial 20 mins 4
2 2022-07-12 03:30:00 trial 20 mins 8
Important:
can only merge_asof on a single key, therefore others would utilize by=
to deal with.
temp = pd.merge_asof(df2, df1[['Timestamp', 'Name', 'Time', 'l_r', 'h_r']], on='Timestamp', by=['Name','Time'], direction='nearest')
print(temp)
df2['Sim Score'] = temp['Sim'].between(temp['l_r'], temp['h_r']).values
print(df2)
###
Timestamp Name Time Sim Sim Score
0 2022-07-10 05:15:00 trial 20 mins 7 True
1 2022-07-11 12:45:00 trial 20 mins 4 True
2 2022-07-12 03:30:00 trial 20 mins 8 False
Reference:
pandas.DataFrame.join
pandas.merge_asof
merging/join concept
You can do it via pandasql
:
But note that you better add a unique constraint to one of the columns (e.g. a number of trial)
from pandasql import sqldf
df3 = sqldf('''
SELECT df2.Timestamp AS Date, df1.Name, df1.Time, df2.Sim,
CASE
WHEN Sim >= (df1.Po - df1.Ref) AND Sim <= (df1.Po + df1.Ref) THEN 'True'
WHEN Sim < (df1.Po - df1.Ref) OR Sim > (df1.Po + df1.Ref) THEN 'False'
END AS 'Sim Score'
FROM df1, df2
WHERE df2.Name == df1.Name AND df2.Time == df1.Time
ORDER BY Date ASC
''')
Also to work with datetime
format in sqldf
you need to name your Timestamp column as date in the query. If you need to get only let’s say first/earliest 5 results add LIMIT 5
in the end of the query.
If you need to get closest date in df2 to df1 try this:
from pandasql import sqldf
df3 = sqldf('''
SELECT df2.Timestamp AS Date1, df2.Timestamp AS Date2,
df1.Name, df1.Time, df2.Sim,
CASE
WHEN Sim >= (df1.Po - df1.Ref) AND Sim <= (df1.Po + df1.Ref) THEN 'True'
WHEN Sim < (df1.Po - df1.Ref) OR Sim > (df1.Po + df1.Ref) THEN 'False'
END AS 'Sim Score'
FROM df1, df2
WHERE df2.Name == df1.Name AND df2.Time == df1.Time
and Date1 <= Date2
group by Date2
ORDER BY Date1 ASC
''')
So I have two data frames. The first data frame contains numerical data that is used to "score" the second data frame which contains simulation data.
df1 = base records
df2 = simulation records
Part 1: What I am trying to accomplish is to query df1 ‘base records’ to find the row that has the most recent timestamp to that in the df2 ‘simulation records’ where the "Name" & "Time" columns match exactly.
Part 2: Then I want to use an if then function to determine whether a value in the simulation record row fall between a range created using two values from the base record row and return a boolean.
low range = df1[‘Po’]-df1[‘Ref’]
high range = df1[‘Po’]+df1[‘Ref’]
if df2[‘Sim’] falls in between the low range & high range of its most recent df1 base record then I want to return true in the new column "Sim Score"
otherwise return false
Part 3: I want to repeat Part 1 & Part 2 for each row in the simulation records.
helpful information:
- df1 (base records) have more or less rows than df2 (simulation records)
- df1 has more columns than df2
- some columns in df1 have the same name but different values in df2
- ideally want to be able to slice both dataframes where the if then function only sees the two rows used in the comparison
- only need the most recent df1 base record to compare to the df2 simulation record
- previously accomplished this in google sheets with if then & query combination formula dragged down entire sheet (want to replace with python & pandas)
df1 base records example (columns that matter)
Timestamp Name Time Po Ref
7/11/2022 11:30:00 trial 20 mins 5 2
7/10/2022 04:00:00 trial 20 mins 4 4
7/09/2022 02:45:00 trial 20 mins 2 2
6/28/2022 03:45:00 trial 20 mins 3 6
df2 simulation records example (columns that matter)
Timestamp Name Time Sim
7/10/2022 05:15:00 trial 20 mins 7
7/11/2022 12:45:00 trial 20 mins 4
7/12/2022 03:30:00 trial 20 mins 8
desired result of new column added to df2
Timestamp Name Time Sim Sim Score
7/10/2022 05:15:00 trial 20 mins 7 True
7/11/2022 12:45:00 trial 20 mins 4 True
7/12/2022 03:30:00 trial 20 mins 8 False
Because you don’t provide code to construct the dataframe, I will sketch a potential solution:
First, I will assume that you have only one timestamp per day (which it looks like in your examples). Accordingly, I would truncate or split the timestamp to only have the date in one column. This is done so we can join the dataframes based on the date, i.e. use set_index("date_column")
for both dataframes (use an inner-join to only keep the rows where the date was present in both dataframes). Finally, you can use apply()
to check your condition:
df_joined['Sim Score'] = df_joined.apply(lambda row: (row['Po']-row['Ref'] <= row['Sim']) and (row['Po']+row['Ref'] >= row['Sim']), axis = 1)
Use pandas.DataFrame.reindex
, its method
offer nearest to find the computable index(e.g., string can not calculate distance)
Or use merge_asof
, its direction
offer nearest.
Method 1:
reindex()
with method='nearest'
df1['Timestamp'] = pd.to_datetime(df1['Timestamp'])
df1.set_index('Timestamp', inplace=True)
df1['l_r'] = df1['Po'] - df1['Ref']
df1['h_r'] = df1['Po'] + df1['Ref']
print(df1)
###
Name Time Po Ref l_r h_r
Timestamp
2022-07-11 11:30:00 trial 20 mins 5 2 3 7
2022-07-10 04:00:00 trial 20 mins 4 4 0 8
2022-07-09 02:45:00 trial 20 mins 2 2 0 4
2022-06-28 03:45:00 trial 20 mins 3 6 -3 9
df2['Timestamp'] = pd.to_datetime(df2['Timestamp'])
df2.set_index('Timestamp', inplace=True)
print(df2)
###
Name Time Sim
Timestamp
2022-07-10 05:15:00 trial 20 mins 7
2022-07-11 12:45:00 trial 20 mins 4
2022-07-12 03:30:00 trial 20 mins 8
temp = df2.join(df1.reindex(df2.index, method='nearest'), lsuffix='_left', rsuffix='_right')
print(temp)
As you can see, this is df2.join(df1)
,
join multiple DataFrame objects by index at once.
with method='nearest'
, in this case, it would join df2
and df1
by the nearest Timestamp
index.
df2['Sim Score'] = temp['Sim'].between(temp['l_r'], temp['h_r']).values
df2.reset_index(inplace=True)
print(df2)
###
Timestamp Name Time Sim Sim Score
0 2022-07-10 05:15:00 trial 20 mins 7 True
1 2022-07-11 12:45:00 trial 20 mins 4 True
2 2022-07-12 03:30:00 trial 20 mins 8 False
Method 2:
merge_asof()
with direction='nearest'
this way is not executed with indexed value, therefore we don’t have to set column Timestamp
as index. But it needs binding objects(in this case we merge on column Timestamp
)sorted.
df1['Timestamp'] = pd.to_datetime(df1['Timestamp'])
# df1.set_index('Timestamp', inplace=True)
df1['l_r'] = df1['Po'] - df1['Ref']
df1['h_r'] = df1['Po'] + df1['Ref']
df1.sort_values(by='Timestamp', inplace=True)
print(df1)
###
Timestamp Name Time Po Ref l_r h_r
3 2022-06-28 03:45:00 trial 20 mins 3 6 -3 9
2 2022-07-09 02:45:00 trial 20 mins 2 2 0 4
1 2022-07-10 04:00:00 trial 20 mins 4 4 0 8
0 2022-07-11 11:30:00 trial 20 mins 5 2 3 7
df2['Timestamp'] = pd.to_datetime(df2['Timestamp'])
# df2.set_index('Timestamp', inplace=True)
df2.sort_values(by='Timestamp', inplace=True)
print(df2)
###
Timestamp Name Time Sim
0 2022-07-10 05:15:00 trial 20 mins 7
1 2022-07-11 12:45:00 trial 20 mins 4
2 2022-07-12 03:30:00 trial 20 mins 8
temp = pd.merge_asof(df2 ,df1[['Timestamp', 'l_r', 'h_r']], on='Timestamp', direction='nearest')
print(temp)
As you can see, this is pd.merge_asof(df2, df1)
,
This is similar to a left-join except that we match on nearest key rather than equal keys. Both DataFrames must be sorted by the key.
For each row in the left DataFrame:
A “nearest” search selects the row in the right DataFrame whose ‘on’ key is closest in absolute distance to the left’s key.
df2['Sim Score'] = temp['Sim'].between(temp['l_r'], temp['h_r']).values
print(df2)
###
Timestamp Name Time Sim Sim Score
0 2022-07-10 05:15:00 trial 20 mins 7 True
1 2022-07-11 12:45:00 trial 20 mins 4 True
2 2022-07-12 03:30:00 trial 20 mins 8 False
Frankly speaking, working on indexed things would be faster if you have a large dataset.
Method 2 (on multiple keys)
I remodified df1
adding different Name and Time
df1 = pd.DataFrame({'Timestamp':['7/11/2022 11:30:00','7/11/2022 11:30:00','7/10/2022 04:00:00','7/10/2022 04:00:00','7/09/2022 02:45:00','6/28/2022 03:45:00'],
'Name':['trial','trial','trial','non-trial','trial','trial'],
'Time':['20 mins','30 mins','20 mins','20 mins','20 mins','20 mins'],
'Po':[5, 6, 4, 1, 2, 3],
'Ref':[2, 2, 4, 3, 2, 6]})
df1['Timestamp'] = pd.to_datetime(df1['Timestamp'])
df1['l_r'] = df1['Po'] - df1['Ref']
df1['h_r'] = df1['Po'] + df1['Ref']
df1.sort_values(by='Timestamp', inplace=True)
print(df1)
###
Timestamp Name Time Po Ref l_r h_r
5 2022-06-28 03:45:00 trial 20 mins 3 6 -3 9
4 2022-07-09 02:45:00 trial 20 mins 2 2 0 4
2 2022-07-10 04:00:00 trial 20 mins 4 4 0 8
3 2022-07-10 04:00:00 non-trial 20 mins 1 3 -2 4
0 2022-07-11 11:30:00 trial 20 mins 5 2 3 7
1 2022-07-11 11:30:00 trial 30 mins 6 2 4 8
print(df2)
###
Timestamp Name Time Sim
0 2022-07-10 05:15:00 trial 20 mins 7
1 2022-07-11 12:45:00 trial 20 mins 4
2 2022-07-12 03:30:00 trial 20 mins 8
Important:
can only merge_asof on a single key, therefore others would utilize by=
to deal with.
temp = pd.merge_asof(df2, df1[['Timestamp', 'Name', 'Time', 'l_r', 'h_r']], on='Timestamp', by=['Name','Time'], direction='nearest')
print(temp)
df2['Sim Score'] = temp['Sim'].between(temp['l_r'], temp['h_r']).values
print(df2)
###
Timestamp Name Time Sim Sim Score
0 2022-07-10 05:15:00 trial 20 mins 7 True
1 2022-07-11 12:45:00 trial 20 mins 4 True
2 2022-07-12 03:30:00 trial 20 mins 8 False
Reference:
pandas.DataFrame.join
pandas.merge_asof
merging/join concept
You can do it via pandasql
:
But note that you better add a unique constraint to one of the columns (e.g. a number of trial)
from pandasql import sqldf
df3 = sqldf('''
SELECT df2.Timestamp AS Date, df1.Name, df1.Time, df2.Sim,
CASE
WHEN Sim >= (df1.Po - df1.Ref) AND Sim <= (df1.Po + df1.Ref) THEN 'True'
WHEN Sim < (df1.Po - df1.Ref) OR Sim > (df1.Po + df1.Ref) THEN 'False'
END AS 'Sim Score'
FROM df1, df2
WHERE df2.Name == df1.Name AND df2.Time == df1.Time
ORDER BY Date ASC
''')
Also to work with datetime
format in sqldf
you need to name your Timestamp column as date in the query. If you need to get only let’s say first/earliest 5 results add LIMIT 5
in the end of the query.
If you need to get closest date in df2 to df1 try this:
from pandasql import sqldf
df3 = sqldf('''
SELECT df2.Timestamp AS Date1, df2.Timestamp AS Date2,
df1.Name, df1.Time, df2.Sim,
CASE
WHEN Sim >= (df1.Po - df1.Ref) AND Sim <= (df1.Po + df1.Ref) THEN 'True'
WHEN Sim < (df1.Po - df1.Ref) OR Sim > (df1.Po + df1.Ref) THEN 'False'
END AS 'Sim Score'
FROM df1, df2
WHERE df2.Name == df1.Name AND df2.Time == df1.Time
and Date1 <= Date2
group by Date2
ORDER BY Date1 ASC
''')