Can I avoid that the join column of the right data frame in a pandas merge appears in the output?
Question:
I am merging two data frames with pandas. I would like to avoid that, when joining, the output includes the join column of the right table.
Example:
import pandas as pd
age = [['tom', 10], ['nick', 15], ['juli', 14]]
df1 = pd.DataFrame(age, columns = ['Name', 'Age'])
toy = [['tom', 'GIJoe'], ['nick', 'car']]
df2 = pd.DataFrame(toy, columns = ['Name_child', 'Toy'])
df = pd.merge(df1,df2,left_on='Name',right_on='Name_child',how='left')
df.columns
will give the output Index(['Name', 'Age', 'Name_child', 'Toy'], dtype='object')
. Is there an easy way to obtain Index(['Name', 'Age', 'Toy'], dtype='object')
instead? I can drop the column afterwards of course like this del df['Name_child']
, but I’d like my code to be as short as possible.
Answers:
Set the index of the second dataframe to "Name_child". If you do this in the merge statement the columns in df2 remain unchanged.
df = pd.merge(df1,df2.set_index('Name_child'),left_on='Name',right_index=True,how='left')
This ouputs the correct columns:
df
Name Age Toy
0 tom 10 GIJoe
1 nick 15 car
2 juli 14 NaN
df.columns
df.columns
Index(['Name', 'Age', 'Toy'], dtype='object')
Based on @mgc comments, you don’t have to rename the columns of df2. Just you pass df2 to merge
function with renamed columns. df2 column names will remain as it is.
df = pd.merge(df1,df2.rename(columns={'Name_child': 'Name'}),on='Name', how='left')
df
Name Age Toy
0 tom 10 GIJoe
1 nick 15 car
2 juli 14 NaN
df.columns
Index(['Name', 'Age', 'Toy'], dtype='object')
df2.columns
Index(['Name_child', 'Toy'], dtype='object')
Seems to be even simpler to drop the column right after.
df = (pd.merge(df1,df2,left_on='Name',right_on='Name_child',how='left')
.drop('Name_child', axis=1))
#----------------
import this
The Zen of Python, by Tim Peters
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
I am merging two data frames with pandas. I would like to avoid that, when joining, the output includes the join column of the right table.
Example:
import pandas as pd
age = [['tom', 10], ['nick', 15], ['juli', 14]]
df1 = pd.DataFrame(age, columns = ['Name', 'Age'])
toy = [['tom', 'GIJoe'], ['nick', 'car']]
df2 = pd.DataFrame(toy, columns = ['Name_child', 'Toy'])
df = pd.merge(df1,df2,left_on='Name',right_on='Name_child',how='left')
df.columns
will give the output Index(['Name', 'Age', 'Name_child', 'Toy'], dtype='object')
. Is there an easy way to obtain Index(['Name', 'Age', 'Toy'], dtype='object')
instead? I can drop the column afterwards of course like this del df['Name_child']
, but I’d like my code to be as short as possible.
Set the index of the second dataframe to "Name_child". If you do this in the merge statement the columns in df2 remain unchanged.
df = pd.merge(df1,df2.set_index('Name_child'),left_on='Name',right_index=True,how='left')
This ouputs the correct columns:
df
Name Age Toy
0 tom 10 GIJoe
1 nick 15 car
2 juli 14 NaN
df.columns
df.columns
Index(['Name', 'Age', 'Toy'], dtype='object')
Based on @mgc comments, you don’t have to rename the columns of df2. Just you pass df2 to merge
function with renamed columns. df2 column names will remain as it is.
df = pd.merge(df1,df2.rename(columns={'Name_child': 'Name'}),on='Name', how='left')
df
Name Age Toy
0 tom 10 GIJoe
1 nick 15 car
2 juli 14 NaN
df.columns
Index(['Name', 'Age', 'Toy'], dtype='object')
df2.columns
Index(['Name_child', 'Toy'], dtype='object')
Seems to be even simpler to drop the column right after.
df = (pd.merge(df1,df2,left_on='Name',right_on='Name_child',how='left')
.drop('Name_child', axis=1))
#----------------
import this
The Zen of Python, by Tim Peters
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.