Get rid of duplicate rows on conditions for a pandas dataframe
Question:
I have a dataframe with lots of duplicated rows on index, like this:
olddf = pd.DataFrame(index=['MJ','MJ','MJ','BJ','KJ','KJ'],data={'name':['masdjsdf','machael jordon','mskkkadke','boris johnson', 'kim jongun', 'kkasdfl'],'age':[23,40,31,35,25,30]})
I need to get rid of the duplicate index(rows) which also don’t match the dictionary
dic = {'MJ':'machael jordon', 'BJ':'boris johnson', 'KJ':'kim jongun'}.
So after the operation, the dataframe should become
newdf = pd.DataFrame(index=['MJ','BJ','KJ'],data={'name':['machael jordon','boris johnson', 'kim jongun',],'age':[40,35,25]})
Thank you…
Answers:
Use map
to set the values from the dictionary keys, then eq
to compare with the column’s data. If equal this yields True, you can use the resulting Series of booleans as a mask to slice the original dataframe:
mask = olddf['name'].eq(olddf.index.map(dic))
newdf = olddf[mask]
Output:
name age
MJ machael jordon 40
BJ boris johnson 35
KJ kim jongun 25
also keeping the non duplicated rows
Simple, add a second mask:
mask2 = ~olddf.index.duplicated(keep=False)
newdf = olddf[mask|mask2]
Take it easy, make it easy. Make the dic a pd.Series and check inclusion using .isin(). Code below
lst={'MJ':'machael jordon', 'BJ':'boris johnson', 'KJ':'kim jongun'}
olddf[olddf.isin(pd.Series(lst, name='name')).any(1)]
Alternatively create a dataframe dict. append the name column to index on both new and old df and then merge on index. left or inner merge would do. Code below
pd.DataFrame(pd.Series(lst, name='name')).set_index('name', append=True).merge(olddf.set_index('name', append=True), how='left',left_index=True, right_index=True )
Outcome
name age
MJ machael jordon 40
BJ boris johnson 35
KJ kim jongun 25
dic = {'MJ':'machael jordon', 'BJ':'boris johnson', 'KJ':'kim jongun'}
olddf.join(pd.Series(dic).to_frame('name'),rsuffix='_2').query("name==name_2")
name age name_2
BJ boris johnson 35 boris johnson
KJ kim jongun 25 kim jongun
MJ machael jordon 40 machael jordon
I have a dataframe with lots of duplicated rows on index, like this:
olddf = pd.DataFrame(index=['MJ','MJ','MJ','BJ','KJ','KJ'],data={'name':['masdjsdf','machael jordon','mskkkadke','boris johnson', 'kim jongun', 'kkasdfl'],'age':[23,40,31,35,25,30]})
I need to get rid of the duplicate index(rows) which also don’t match the dictionary
dic = {'MJ':'machael jordon', 'BJ':'boris johnson', 'KJ':'kim jongun'}.
So after the operation, the dataframe should become
newdf = pd.DataFrame(index=['MJ','BJ','KJ'],data={'name':['machael jordon','boris johnson', 'kim jongun',],'age':[40,35,25]})
Thank you…
Use map
to set the values from the dictionary keys, then eq
to compare with the column’s data. If equal this yields True, you can use the resulting Series of booleans as a mask to slice the original dataframe:
mask = olddf['name'].eq(olddf.index.map(dic))
newdf = olddf[mask]
Output:
name age
MJ machael jordon 40
BJ boris johnson 35
KJ kim jongun 25
also keeping the non duplicated rows
Simple, add a second mask:
mask2 = ~olddf.index.duplicated(keep=False)
newdf = olddf[mask|mask2]
Take it easy, make it easy. Make the dic a pd.Series and check inclusion using .isin(). Code below
lst={'MJ':'machael jordon', 'BJ':'boris johnson', 'KJ':'kim jongun'}
olddf[olddf.isin(pd.Series(lst, name='name')).any(1)]
Alternatively create a dataframe dict. append the name column to index on both new and old df and then merge on index. left or inner merge would do. Code below
pd.DataFrame(pd.Series(lst, name='name')).set_index('name', append=True).merge(olddf.set_index('name', append=True), how='left',left_index=True, right_index=True )
Outcome
name age
MJ machael jordon 40
BJ boris johnson 35
KJ kim jongun 25
dic = {'MJ':'machael jordon', 'BJ':'boris johnson', 'KJ':'kim jongun'}
olddf.join(pd.Series(dic).to_frame('name'),rsuffix='_2').query("name==name_2")
name age name_2
BJ boris johnson 35 boris johnson
KJ kim jongun 25 kim jongun
MJ machael jordon 40 machael jordon