Pandas' merge returns a column with _x appended to the name
Question:
I have to dataframes, df1
has columns A, B, C, D… and df2
has columns A, B, E, F…
The keys I want to merge with are in column A
. B
is also (most likely) the same in both dataframes. This is a big dataset so I do not have a good overview of everything yet.
I do
python
pd.merge(df1, df2, on=’A’)
And the results contains a column called `B_x`. Since the dataset is big and messy I haven't tried to investigate how `B_x` differs from `B` in `df1` and `B` in `df2`.
So my question is just in general: What does Pandas mean when it has appended `_x` to a column name in the merged dataframe?
Answers:
The suffixes are added for any clashes in column names that are not involved in the merge operation, see online docs.
So in your case if you think that they are same you could just do the merge on both columns:
pd.merge(df1, df2, on=['A', 'B'])
What this will do though is return only the values where A
and B
exist in both dataframes as the default merge type is an inner
merge.
So what you could do is compare this merged df size with your first one and see if they are the same and if so you could do a merge on both columns or just drop/rename the _x
/_y
suffix B
columns.
I would spend time though determining if these values are indeed the same and exist in both dataframes, in which case you may wish to perform an outer
merge:
pd.merge(df1, df2, on=['A', 'B'], how='outer')
Then what you could do is then drop duplicate rows (and possibly any NaN
rows) and that should give you a clean merged dataframe.
merged_df.drop_duplicates(cols=['A', 'B'],inplace=True)
See online docs for drop_duplicates
A merged dataframe shouldn’t have overlapping column names, so as EdChum mentioned, if the merged dataframe has B_x
when it should have B
, then it means both dataframes had column B
and pandas made the executive decision to add suffixes _x
to the B
column of the left dataframe and _y
to the B
column of the right dataframe.
In fact, you can change what these suffixes should be by passing a tuple to suffixes=
parameter of merge()
. For example,
merged_df = df1.merge(df2, on='A', suffixes=('_left', '_right'))
Now, merged_df
will have B_left
instead of B_x
. If you pass empty strings:
df1.merge(df2, on='A', suffixes=('', ''))
you’ll get a ValueError similar to the following
ValueError: columns overlap but no suffix specified: Index(['B'], dtype='object')
which says that overlapping columns were identified.
I have to dataframes, df1
has columns A, B, C, D… and df2
has columns A, B, E, F…
The keys I want to merge with are in column A
. B
is also (most likely) the same in both dataframes. This is a big dataset so I do not have a good overview of everything yet.
I do
python
pd.merge(df1, df2, on=’A’)
And the results contains a column called `B_x`. Since the dataset is big and messy I haven't tried to investigate how `B_x` differs from `B` in `df1` and `B` in `df2`.
So my question is just in general: What does Pandas mean when it has appended `_x` to a column name in the merged dataframe?
The suffixes are added for any clashes in column names that are not involved in the merge operation, see online docs.
So in your case if you think that they are same you could just do the merge on both columns:
pd.merge(df1, df2, on=['A', 'B'])
What this will do though is return only the values where A
and B
exist in both dataframes as the default merge type is an inner
merge.
So what you could do is compare this merged df size with your first one and see if they are the same and if so you could do a merge on both columns or just drop/rename the _x
/_y
suffix B
columns.
I would spend time though determining if these values are indeed the same and exist in both dataframes, in which case you may wish to perform an outer
merge:
pd.merge(df1, df2, on=['A', 'B'], how='outer')
Then what you could do is then drop duplicate rows (and possibly any NaN
rows) and that should give you a clean merged dataframe.
merged_df.drop_duplicates(cols=['A', 'B'],inplace=True)
See online docs for drop_duplicates
A merged dataframe shouldn’t have overlapping column names, so as EdChum mentioned, if the merged dataframe has B_x
when it should have B
, then it means both dataframes had column B
and pandas made the executive decision to add suffixes _x
to the B
column of the left dataframe and _y
to the B
column of the right dataframe.
In fact, you can change what these suffixes should be by passing a tuple to suffixes=
parameter of merge()
. For example,
merged_df = df1.merge(df2, on='A', suffixes=('_left', '_right'))
Now, merged_df
will have B_left
instead of B_x
. If you pass empty strings:
df1.merge(df2, on='A', suffixes=('', ''))
you’ll get a ValueError similar to the following
ValueError: columns overlap but no suffix specified: Index(['B'], dtype='object')
which says that overlapping columns were identified.