Merging columns Dataframe
Question:
I have the following Dataframe:
df1
startTimeIso
endTimeIso
id
2023-03-07T03:28:56.969000
2023-03-07T03:29:25.396000
5
2023-03-07T03:57:08.734000
2023-03-07T03:59:08.734000
7
2023-03-07T04:18:08.734000
2023-03-07T04:20:10.271000
16
2023-03-07T07:58:08.734000
2023-03-07T07:58:10.271000
21
and the second one:
df2
startTimeIso
endTimeIso
value
2023-03-07T03:28:57.169000
2023-03-07T03:29:25.996000
true
2023-03-07T03:57:08.734000
2023-03-07T03:58:08.734000
true
2023-03-07T05:38:08.734000
2023-03-07T05:40:10.271000
true
2023-03-07T07:58:08.934000
2023-03-07T07:58:10.371000
true
I want to check, if a row from df2 merge with a row from df1. There can be a tolerance from 1 Second. StartTimeIso as well as endTimeIso should be considered.
The result should look like this:
df_merged
startTimeIso
endTimeIso
value
startTimeIso_y
endTimeIso_y
id
2023-03-07T03:28:57.169000
2023-03-07T03:29:25.996000
true
2023-03-07T03:28:56.969000
2023-03-07T03:29:25.396000
5
2023-03-07T03:57:08.734000
2023-03-07T03:58:08.734000
true
None
None
None
2023-03-07T05:38:08.734000
2023-03-07T05:40:10.271000
true
None
None
None
2023-03-07T07:58:08.934000
2023-03-07T07:58:10.371000
true
2023-03-07T07:58:08.734000
2023-03-07T07:58:10.271000
21
Rows_found = 3
Answers:
one-to-one merge using merge_asof
Use a merge_asof
with tolerance
:
df1[['startTimeIso', 'endTimeIso']] = df1[['startTimeIso', 'endTimeIso']].apply(pd.to_datetime)
df2[['startTimeIso', 'endTimeIso']] = df2[['startTimeIso', 'endTimeIso']].apply(pd.to_datetime)
out = pd.merge_asof(
df2.sort_values(by='startTimeIso'),
df1.sort_values(by='startTimeIso')
.rename(columns={'startTimeIso': 'startTimeIso_y'}),
left_on='startTimeIso', right_on='startTimeIso_y',
direction='nearest', tolerance=pd.Timedelta('1s'),
suffixes=(None, '_y')
)
print(out)
Output:
startTimeIso endTimeIso value endTimeIso_y id
0 2023-03-07 03:28:57.169 2023-03-07 03:29:25.996 True 2023-03-07 03:29:25.396 5.0
1 2023-03-07 03:57:08.734 2023-03-07 03:58:08.734 True 2023-03-07 03:58:08.734 7.0
2 2023-03-07 05:38:08.734 2023-03-07 05:40:10.271 True NaT NaN
3 2023-03-07 07:58:08.934 2023-03-07 07:58:10.371 True 2023-03-07 07:58:10.271 21.0
If you want to consider either start or end, perform two merges and combine_first
:
out1 = pd.merge_asof(
df2.sort_values(by='startTimeIso').reset_index(),
df1.sort_values(by='startTimeIso')
.rename(columns={'startTimeIso': 'startTimeIso_y'}),
left_on='startTimeIso', right_on='startTimeIso_y',
direction='nearest', tolerance=pd.Timedelta('1s'),
suffixes=(None, '_y')
)
out2 = pd.merge_asof(
df2.sort_values(by='endTimeIso').reset_index(),
df1.sort_values(by='endTimeIso')
.rename(columns={'endTimeIso': 'endTimeIso_y'}),
left_on='endTimeIso', right_on='endTimeIso_y',
direction='nearest', tolerance=pd.Timedelta('1s'),
suffixes=(None, '_y')
)
out = out1.combine_first(out2).set_index('index')
print(out)
many-to-many merge using numpy
s1 = df1['startTimeIso'].to_numpy()[:,None]
s2 = df2['startTimeIso'].to_numpy()
e1 = df1['endTimeIso'].to_numpy()[:,None]
e2 = df2['endTimeIso'].to_numpy()
ms = abs(s1-s2) < pd.Timedelta('1s')
me = abs(e1-e2) < pd.Timedelta('1s')
idx1, idx2 = np.where(ms&me)
out = df2.join(df1.iloc[idx1].set_axis(df2.index[idx2])
.rename(columns={'startTimeIso': 'startTimeIso_y',
'endTimeIso': 'endTimeIso_y'}))
I have the following Dataframe:
df1
startTimeIso | endTimeIso | id |
---|---|---|
2023-03-07T03:28:56.969000 | 2023-03-07T03:29:25.396000 | 5 |
2023-03-07T03:57:08.734000 | 2023-03-07T03:59:08.734000 | 7 |
2023-03-07T04:18:08.734000 | 2023-03-07T04:20:10.271000 | 16 |
2023-03-07T07:58:08.734000 | 2023-03-07T07:58:10.271000 | 21 |
and the second one:
df2
startTimeIso | endTimeIso | value |
---|---|---|
2023-03-07T03:28:57.169000 | 2023-03-07T03:29:25.996000 | true |
2023-03-07T03:57:08.734000 | 2023-03-07T03:58:08.734000 | true |
2023-03-07T05:38:08.734000 | 2023-03-07T05:40:10.271000 | true |
2023-03-07T07:58:08.934000 | 2023-03-07T07:58:10.371000 | true |
I want to check, if a row from df2 merge with a row from df1. There can be a tolerance from 1 Second. StartTimeIso as well as endTimeIso should be considered.
The result should look like this:
df_merged
startTimeIso | endTimeIso | value | startTimeIso_y | endTimeIso_y | id |
---|---|---|---|---|---|
2023-03-07T03:28:57.169000 | 2023-03-07T03:29:25.996000 | true | 2023-03-07T03:28:56.969000 | 2023-03-07T03:29:25.396000 | 5 |
2023-03-07T03:57:08.734000 | 2023-03-07T03:58:08.734000 | true | None | None | None |
2023-03-07T05:38:08.734000 | 2023-03-07T05:40:10.271000 | true | None | None | None |
2023-03-07T07:58:08.934000 | 2023-03-07T07:58:10.371000 | true | 2023-03-07T07:58:08.734000 | 2023-03-07T07:58:10.271000 | 21 |
Rows_found = 3
one-to-one merge using merge_asof
Use a merge_asof
with tolerance
:
df1[['startTimeIso', 'endTimeIso']] = df1[['startTimeIso', 'endTimeIso']].apply(pd.to_datetime)
df2[['startTimeIso', 'endTimeIso']] = df2[['startTimeIso', 'endTimeIso']].apply(pd.to_datetime)
out = pd.merge_asof(
df2.sort_values(by='startTimeIso'),
df1.sort_values(by='startTimeIso')
.rename(columns={'startTimeIso': 'startTimeIso_y'}),
left_on='startTimeIso', right_on='startTimeIso_y',
direction='nearest', tolerance=pd.Timedelta('1s'),
suffixes=(None, '_y')
)
print(out)
Output:
startTimeIso endTimeIso value endTimeIso_y id
0 2023-03-07 03:28:57.169 2023-03-07 03:29:25.996 True 2023-03-07 03:29:25.396 5.0
1 2023-03-07 03:57:08.734 2023-03-07 03:58:08.734 True 2023-03-07 03:58:08.734 7.0
2 2023-03-07 05:38:08.734 2023-03-07 05:40:10.271 True NaT NaN
3 2023-03-07 07:58:08.934 2023-03-07 07:58:10.371 True 2023-03-07 07:58:10.271 21.0
If you want to consider either start or end, perform two merges and combine_first
:
out1 = pd.merge_asof(
df2.sort_values(by='startTimeIso').reset_index(),
df1.sort_values(by='startTimeIso')
.rename(columns={'startTimeIso': 'startTimeIso_y'}),
left_on='startTimeIso', right_on='startTimeIso_y',
direction='nearest', tolerance=pd.Timedelta('1s'),
suffixes=(None, '_y')
)
out2 = pd.merge_asof(
df2.sort_values(by='endTimeIso').reset_index(),
df1.sort_values(by='endTimeIso')
.rename(columns={'endTimeIso': 'endTimeIso_y'}),
left_on='endTimeIso', right_on='endTimeIso_y',
direction='nearest', tolerance=pd.Timedelta('1s'),
suffixes=(None, '_y')
)
out = out1.combine_first(out2).set_index('index')
print(out)
many-to-many merge using numpy
s1 = df1['startTimeIso'].to_numpy()[:,None]
s2 = df2['startTimeIso'].to_numpy()
e1 = df1['endTimeIso'].to_numpy()[:,None]
e2 = df2['endTimeIso'].to_numpy()
ms = abs(s1-s2) < pd.Timedelta('1s')
me = abs(e1-e2) < pd.Timedelta('1s')
idx1, idx2 = np.where(ms&me)
out = df2.join(df1.iloc[idx1].set_axis(df2.index[idx2])
.rename(columns={'startTimeIso': 'startTimeIso_y',
'endTimeIso': 'endTimeIso_y'}))