Query for one dataframe row based on row in another dataframe & compare values

Question:

So I have two data frames. The first data frame contains numerical data that is used to "score" the second data frame which contains simulation data.

df1 = base records

df2 = simulation records

Part 1: What I am trying to accomplish is to query df1 ‘base records’ to find the row that has the most recent timestamp to that in the df2 ‘simulation records’ where the "Name" & "Time" columns match exactly.

Part 2: Then I want to use an if then function to determine whether a value in the simulation record row fall between a range created using two values from the base record row and return a boolean.

low range = df1[‘Po’]-df1[‘Ref’]

high range = df1[‘Po’]+df1[‘Ref’]

if df2[‘Sim’] falls in between the low range & high range of its most recent df1 base record then I want to return true in the new column "Sim Score"
otherwise return false

Part 3: I want to repeat Part 1 & Part 2 for each row in the simulation records.

helpful information:

  • df1 (base records) have more or less rows than df2 (simulation records)
  • df1 has more columns than df2
  • some columns in df1 have the same name but different values in df2
  • ideally want to be able to slice both dataframes where the if then function only sees the two rows used in the comparison
  • only need the most recent df1 base record to compare to the df2 simulation record
  • previously accomplished this in google sheets with if then & query combination formula dragged down entire sheet (want to replace with python & pandas)
df1 base records example (columns that matter)

Timestamp            Name     Time     Po  Ref

7/11/2022 11:30:00   trial   20 mins   5   2

7/10/2022 04:00:00   trial   20 mins   4   4

7/09/2022 02:45:00   trial   20 mins   2   2

6/28/2022 03:45:00   trial   20 mins   3   6
df2 simulation records example (columns that matter)

Timestamp             Name     Time     Sim

7/10/2022 05:15:00    trial   20 mins   7

7/11/2022 12:45:00    trial   20 mins   4

7/12/2022 03:30:00    trial   20 mins   8
desired result of new column added to df2

Timestamp             Name     Time     Sim  Sim Score

7/10/2022 05:15:00    trial   20 mins   7    True

7/11/2022 12:45:00    trial   20 mins   4    True

7/12/2022 03:30:00    trial   20 mins   8    False
Asked By: Chloe

||

Answers:

Because you don’t provide code to construct the dataframe, I will sketch a potential solution:

First, I will assume that you have only one timestamp per day (which it looks like in your examples). Accordingly, I would truncate or split the timestamp to only have the date in one column. This is done so we can join the dataframes based on the date, i.e. use set_index("date_column") for both dataframes (use an inner-join to only keep the rows where the date was present in both dataframes). Finally, you can use apply() to check your condition:

df_joined['Sim Score'] = df_joined.apply(lambda row: (row['Po']-row['Ref'] <= row['Sim']) and (row['Po']+row['Ref'] >= row['Sim']), axis = 1)
Answered By: freyberg

Use pandas.DataFrame.reindex, its method offer nearest to find the computable index(e.g., string can not calculate distance)

Or use merge_asof, its direction offer nearest.


Method 1:

reindex() with method='nearest'

df1['Timestamp'] = pd.to_datetime(df1['Timestamp'])
df1.set_index('Timestamp', inplace=True)
df1['l_r'] = df1['Po'] - df1['Ref']
df1['h_r'] = df1['Po'] + df1['Ref']
print(df1)
###
                      Name     Time  Po  Ref  l_r  h_r
Timestamp                                             
2022-07-11 11:30:00  trial  20 mins   5    2    3    7
2022-07-10 04:00:00  trial  20 mins   4    4    0    8
2022-07-09 02:45:00  trial  20 mins   2    2    0    4
2022-06-28 03:45:00  trial  20 mins   3    6   -3    9
df2['Timestamp'] = pd.to_datetime(df2['Timestamp'])
df2.set_index('Timestamp', inplace=True)
print(df2)
###
                      Name     Time  Sim
Timestamp                               
2022-07-10 05:15:00  trial  20 mins    7
2022-07-11 12:45:00  trial  20 mins    4
2022-07-12 03:30:00  trial  20 mins    8
temp = df2.join(df1.reindex(df2.index, method='nearest'), lsuffix='_left', rsuffix='_right')
print(temp)

enter image description here

As you can see, this is df2.join(df1),

join multiple DataFrame objects by index at once.

with method='nearest', in this case, it would join df2 and df1 by the nearest Timestamp index.

df2['Sim Score'] = temp['Sim'].between(temp['l_r'], temp['h_r']).values
df2.reset_index(inplace=True)
print(df2)
###
            Timestamp   Name     Time  Sim  Sim Score
0 2022-07-10 05:15:00  trial  20 mins    7       True
1 2022-07-11 12:45:00  trial  20 mins    4       True
2 2022-07-12 03:30:00  trial  20 mins    8      False

Method 2:

merge_asof() with direction='nearest'
this way is not executed with indexed value, therefore we don’t have to set column Timestamp as index. But it needs binding objects(in this case we merge on column Timestamp)sorted.

df1['Timestamp'] = pd.to_datetime(df1['Timestamp'])
# df1.set_index('Timestamp', inplace=True)
df1['l_r'] = df1['Po'] - df1['Ref']
df1['h_r'] = df1['Po'] + df1['Ref']
df1.sort_values(by='Timestamp', inplace=True)
print(df1)
###
            Timestamp   Name     Time  Po  Ref  l_r  h_r
3 2022-06-28 03:45:00  trial  20 mins   3    6   -3    9
2 2022-07-09 02:45:00  trial  20 mins   2    2    0    4
1 2022-07-10 04:00:00  trial  20 mins   4    4    0    8
0 2022-07-11 11:30:00  trial  20 mins   5    2    3    7
df2['Timestamp'] = pd.to_datetime(df2['Timestamp'])
# df2.set_index('Timestamp', inplace=True)
df2.sort_values(by='Timestamp', inplace=True)
print(df2)
###
            Timestamp   Name     Time  Sim
0 2022-07-10 05:15:00  trial  20 mins    7
1 2022-07-11 12:45:00  trial  20 mins    4
2 2022-07-12 03:30:00  trial  20 mins    8

temp = pd.merge_asof(df2 ,df1[['Timestamp', 'l_r', 'h_r']], on='Timestamp', direction='nearest')
print(temp)

enter image description here
As you can see, this is pd.merge_asof(df2, df1),

This is similar to a left-join except that we match on nearest key rather than equal keys. Both DataFrames must be sorted by the key.

For each row in the left DataFrame:

A “nearest” search selects the row in the right DataFrame whose ‘on’ key is closest in absolute distance to the left’s key.

df2['Sim Score'] = temp['Sim'].between(temp['l_r'], temp['h_r']).values
print(df2)
###
            Timestamp   Name     Time  Sim  Sim Score
0 2022-07-10 05:15:00  trial  20 mins    7       True
1 2022-07-11 12:45:00  trial  20 mins    4       True
2 2022-07-12 03:30:00  trial  20 mins    8      False

Frankly speaking, working on indexed things would be faster if you have a large dataset.

Method 2 (on multiple keys)

I remodified df1 adding different Name and Time

df1 = pd.DataFrame({'Timestamp':['7/11/2022 11:30:00','7/11/2022 11:30:00','7/10/2022 04:00:00','7/10/2022 04:00:00','7/09/2022 02:45:00','6/28/2022 03:45:00'],
                    'Name':['trial','trial','trial','non-trial','trial','trial'],
                    'Time':['20 mins','30 mins','20 mins','20 mins','20 mins','20 mins'],
                    'Po':[5, 6, 4, 1, 2, 3],
                    'Ref':[2, 2, 4, 3, 2, 6]})
df1['Timestamp'] = pd.to_datetime(df1['Timestamp'])
df1['l_r'] = df1['Po'] - df1['Ref']
df1['h_r'] = df1['Po'] + df1['Ref']
df1.sort_values(by='Timestamp', inplace=True)
print(df1)
###
            Timestamp       Name     Time  Po  Ref  l_r  h_r
5 2022-06-28 03:45:00      trial  20 mins   3    6   -3    9
4 2022-07-09 02:45:00      trial  20 mins   2    2    0    4
2 2022-07-10 04:00:00      trial  20 mins   4    4    0    8
3 2022-07-10 04:00:00  non-trial  20 mins   1    3   -2    4
0 2022-07-11 11:30:00      trial  20 mins   5    2    3    7
1 2022-07-11 11:30:00      trial  30 mins   6    2    4    8
print(df2)
###
            Timestamp   Name     Time  Sim
0 2022-07-10 05:15:00  trial  20 mins    7
1 2022-07-11 12:45:00  trial  20 mins    4
2 2022-07-12 03:30:00  trial  20 mins    8

Important:

can only merge_asof on a single key, therefore others would utilize by= to deal with.

temp = pd.merge_asof(df2, df1[['Timestamp', 'Name', 'Time', 'l_r', 'h_r']], on='Timestamp', by=['Name','Time'], direction='nearest')
print(temp)

enter image description here

df2['Sim Score'] = temp['Sim'].between(temp['l_r'], temp['h_r']).values
print(df2)
###
            Timestamp   Name     Time  Sim  Sim Score
0 2022-07-10 05:15:00  trial  20 mins    7       True
1 2022-07-11 12:45:00  trial  20 mins    4       True
2 2022-07-12 03:30:00  trial  20 mins    8      False

Reference:
pandas.DataFrame.join
pandas.merge_asof
merging/join concept

Answered By: Baron Legendre

You can do it via pandasql:
But note that you better add a unique constraint to one of the columns (e.g. a number of trial)

from pandasql import sqldf

df3 = sqldf('''
    SELECT df2.Timestamp AS Date, df1.Name, df1.Time, df2.Sim,
    
    CASE
        WHEN Sim >= (df1.Po - df1.Ref) AND Sim <= (df1.Po + df1.Ref) THEN 'True'
        WHEN Sim < (df1.Po - df1.Ref) OR Sim > (df1.Po + df1.Ref) THEN 'False'
    END AS 'Sim Score'
    
    FROM df1, df2
    
    WHERE df2.Name == df1.Name AND df2.Time == df1.Time
    ORDER BY Date ASC
''')

Also to work with datetime format in sqldf you need to name your Timestamp column as date in the query. If you need to get only let’s say first/earliest 5 results add LIMIT 5 in the end of the query.

If you need to get closest date in df2 to df1 try this:

from pandasql import sqldf

df3 = sqldf('''
    SELECT df2.Timestamp AS Date1, df2.Timestamp AS Date2, 
           df1.Name, df1.Time, df2.Sim,
    
    CASE
        WHEN Sim >= (df1.Po - df1.Ref) AND Sim <= (df1.Po + df1.Ref) THEN 'True'
        WHEN Sim < (df1.Po - df1.Ref) OR Sim > (df1.Po + df1.Ref) THEN 'False'
    END AS 'Sim Score'
    
    FROM df1, df2
    
    WHERE df2.Name == df1.Name AND df2.Time == df1.Time
    and Date1 <= Date2
    group by Date2
    ORDER BY Date1 ASC
''')
Answered By: George