Merging columns Dataframe

Question:

I have the following Dataframe:
df1

startTimeIso endTimeIso id
2023-03-07T03:28:56.969000 2023-03-07T03:29:25.396000 5
2023-03-07T03:57:08.734000 2023-03-07T03:59:08.734000 7
2023-03-07T04:18:08.734000 2023-03-07T04:20:10.271000 16
2023-03-07T07:58:08.734000 2023-03-07T07:58:10.271000 21

and the second one:
df2

startTimeIso endTimeIso value
2023-03-07T03:28:57.169000 2023-03-07T03:29:25.996000 true
2023-03-07T03:57:08.734000 2023-03-07T03:58:08.734000 true
2023-03-07T05:38:08.734000 2023-03-07T05:40:10.271000 true
2023-03-07T07:58:08.934000 2023-03-07T07:58:10.371000 true

I want to check, if a row from df2 merge with a row from df1. There can be a tolerance from 1 Second. StartTimeIso as well as endTimeIso should be considered.

The result should look like this:
df_merged

startTimeIso endTimeIso value startTimeIso_y endTimeIso_y id
2023-03-07T03:28:57.169000 2023-03-07T03:29:25.996000 true 2023-03-07T03:28:56.969000 2023-03-07T03:29:25.396000 5
2023-03-07T03:57:08.734000 2023-03-07T03:58:08.734000 true None None None
2023-03-07T05:38:08.734000 2023-03-07T05:40:10.271000 true None None None
2023-03-07T07:58:08.934000 2023-03-07T07:58:10.371000 true 2023-03-07T07:58:08.734000 2023-03-07T07:58:10.271000 21

Rows_found = 3

Asked By: hubi3012

||

Answers:

one-to-one merge using merge_asof

Use a merge_asof with tolerance:

df1[['startTimeIso', 'endTimeIso']] = df1[['startTimeIso', 'endTimeIso']].apply(pd.to_datetime)
df2[['startTimeIso', 'endTimeIso']] = df2[['startTimeIso', 'endTimeIso']].apply(pd.to_datetime)

out = pd.merge_asof(
    df2.sort_values(by='startTimeIso'),
    df1.sort_values(by='startTimeIso')
       .rename(columns={'startTimeIso': 'startTimeIso_y'}),
    left_on='startTimeIso', right_on='startTimeIso_y',
    direction='nearest', tolerance=pd.Timedelta('1s'),
    suffixes=(None, '_y')
)

print(out)

Output:

             startTimeIso              endTimeIso  value            endTimeIso_y    id
0 2023-03-07 03:28:57.169 2023-03-07 03:29:25.996   True 2023-03-07 03:29:25.396   5.0
1 2023-03-07 03:57:08.734 2023-03-07 03:58:08.734   True 2023-03-07 03:58:08.734   7.0
2 2023-03-07 05:38:08.734 2023-03-07 05:40:10.271   True                     NaT   NaN
3 2023-03-07 07:58:08.934 2023-03-07 07:58:10.371   True 2023-03-07 07:58:10.271  21.0

If you want to consider either start or end, perform two merges and combine_first:

out1 = pd.merge_asof(
    df2.sort_values(by='startTimeIso').reset_index(),
    df1.sort_values(by='startTimeIso')
       .rename(columns={'startTimeIso': 'startTimeIso_y'}),
    left_on='startTimeIso', right_on='startTimeIso_y',
    direction='nearest', tolerance=pd.Timedelta('1s'),
    suffixes=(None, '_y')
)

out2 = pd.merge_asof(
    df2.sort_values(by='endTimeIso').reset_index(),
    df1.sort_values(by='endTimeIso')
       .rename(columns={'endTimeIso': 'endTimeIso_y'}),
    left_on='endTimeIso', right_on='endTimeIso_y',
    direction='nearest', tolerance=pd.Timedelta('1s'),
    suffixes=(None, '_y')
)

out = out1.combine_first(out2).set_index('index')

print(out)

many-to-many merge using

s1 = df1['startTimeIso'].to_numpy()[:,None]
s2 = df2['startTimeIso'].to_numpy()
e1 = df1['endTimeIso'].to_numpy()[:,None]
e2 = df2['endTimeIso'].to_numpy()

ms = abs(s1-s2) < pd.Timedelta('1s')
me = abs(e1-e2) < pd.Timedelta('1s')

idx1, idx2 = np.where(ms&me)

out = df2.join(df1.iloc[idx1].set_axis(df2.index[idx2])
                  .rename(columns={'startTimeIso': 'startTimeIso_y',
                                   'endTimeIso': 'endTimeIso_y'}))
Answered By: mozway
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.