merging df in pandas based on "contains" values

Question

I have 2 dfs

df_1

Nº.do Incidente Status  Description Per_Extracao
0   IN6948271   ENCERRADO   GR26 D.I.T.I. >>> ABEND NO JOB PP_SASG_GD9822...    DE : 2022/01/05 ATÉ : 2022/12/08
1   IN6948304   ENCERRADO   GR26 D.I.T.I. >>> ABEND NO JOB PP_AACE_R4539 ...    DE : 2022/01/05 ATÉ : 2022/12/08
2   IN6948307   ENCERRADO   GR26 D.I.T.I. >>> ABEND NO JOB PP_ADAT_SPRK_EX...   DE : 2022/01/05 ATÉ : 2022/12/08
3   IN6948309   ENCERRADO   GR26 D.I.T.I. >>> ABEND NO JOB PP_ADAT_SPRK_EX...   DE : 2022/01/05 ATÉ : 2022/12/08
4   IN6948310   ENCERRADO   GR26 D.I.T.I. >>> ABEND NO JOB PP_ADAT_SPRK_EX...   DE : 2022/01/05 ATÉ : 2022/12/08
5   IN6948311   ENCERRADO   GR26 D.I.T.I. >>> ABEND NO JOB PP_ADAT_SPRK_EX...   DE : 2022/01/05 ATÉ : 2022/12/08

df_2

    JOB_NAME    JOB_STREAM_NAME
0   PP_AACD_NR_D8706_TIHIBRIDA_PROC_EXCUC_D P26_AACD_FAC_TOD
1   PP_SASG_GD9822  P26_AACE_U08
2   PP_AACE_R4539   P26_AACE_U09
3   PP_AACE_R4539_CONS_JUNC P26_AACE_U08
4   PP_AACE_R4539_FMRC_TD_01    P26_AACE_U08
5   PP_AACE_R4539_FMRC_TD_02    P26_AACE_U08

I’m trying to merge then based on the value of JOB_NAME in df_2

the output should be something like this

merged_df

Nº.do Incidente Status  Description Per_Extracao JOB_NAME    JOB_STREAM_NAME
0   IN6948271   ENCERRADO   GR26 D.I.T.I. >>> ABEND NO JOB PP_SASG_GD9822...    DE : 2022/01/05 ATÉ : 2022/12/08 PP_SASG_GD9822  P26_AACE_U08
1   IN6948304   ENCERRADO   GR26 D.I.T.I. >>> ABEND NO JOB PP_AACE_R4539 ...    DE : 2022/01/05 ATÉ : 2022/12/08 PP_AACE_R4539   P26_AACE_U09

its not a regular join, its a contains contains condition("JOB_NAME" value in df_2 founded in "Description" column of df_1).

could you guys help me please?

Asked By: gfernandes

||

Source

Answer 1

Based on this answer we can merge your dfs with some little changes.

pat = "|".join(map(str, df2['JOB_NAME']))
df1.insert(0, 'merge_key', df1['Description'].str.extract("("+pat+")", expand=False))

df = df1.merge(df2, left_on='merge_key', right_on='JOB_NAME').drop('merge_key', axis=1)
print(df)

       Nº.do Incidente Status                                  Description  jobName                      Per_Extracao        JOB_NAME JOB_STREAM_NAME
0  ENCERRADO             GR26  D.I.T.I. >>> ABEND NO JOB PP_SASG_GD9822...      NaN  DE : 2022/01/05 ATÉ : 2022/12/08  PP_SASG_GD9822    P26_AACE_U08
1  ENCERRADO             GR26  D.I.T.I. >>> ABEND NO JOB PP_AACE_R4539 ...      NaN  DE : 2022/01/05 ATÉ : 2022/12/08   PP_AACE_R4539    P26_AACE_U09

Explanation:

The variable pat contains one big string with all jobnames we are searching for, seperated by |.

Then we insert a new column called merge_key at the 1st position (which doesn’t really matter here, it could also be the last column) with str.extraxt, which searchs in the column df1['description'] in every row, if there is a match with any name of all job names in pat and if there is a match, it adds that match to the new column (we seperated them with | because in regex the | stands for or, the () are needed to make it caputered groups (see the documentation) for more details on regular expressions.

With the new column we can merge them two df on merge_key and JOB_NAME
At the end we drop that column merge_key since we only needed it for the merging.

Answered By: Rabinzel

Answer 2

The easiest way is:

– Imagine this is your first csv file ::

and this is second csv file ::

So as you can see these 2 data is like yours on pattern , there are only 2 rows on Sensor_2 column that are equal to the other file . ( check 1st and 2nd row on both tables)

now what we can do is :

import pandas as pd 

df1 = pd.read_excel("myData1.xlsx")
df1.head()

df2 = pd.read_excel("myData2.xlsx")
df2.head()

df_merged = df1.merge(df2,on=["Sensor_2"],suffixes=('', '_drop'))
df_merged.drop([col for col in df_merged.columns if 'drop' in col], axis=1, inplace=True)
print(df_merged.head())

and the result is :

   Sensor_1  Sensor_2  Sensor_3  Year
0        10         1       1.5  2020
1        11         2       2.0  2020

Answered By: two.X

merging df in pandas based on "contains" values

Question:

Answers: