Get rid of duplicate rows on conditions for a pandas dataframe

Question

I have a dataframe with lots of duplicated rows on index, like this:

olddf = pd.DataFrame(index=['MJ','MJ','MJ','BJ','KJ','KJ'],data={'name':['masdjsdf','machael jordon','mskkkadke','boris johnson', 'kim jongun', 'kkasdfl'],'age':[23,40,31,35,25,30]})

I need to get rid of the duplicate index(rows) which also don’t match the dictionary
dic = {'MJ':'machael jordon', 'BJ':'boris johnson', 'KJ':'kim jongun'}.
So after the operation, the dataframe should become

newdf = pd.DataFrame(index=['MJ','BJ','KJ'],data={'name':['machael jordon','boris johnson', 'kim jongun',],'age':[40,35,25]})

Thank you…

Asked By: Alex

||

Source

Answer 1

Use map to set the values from the dictionary keys, then eq to compare with the column’s data. If equal this yields True, you can use the resulting Series of booleans as a mask to slice the original dataframe:

mask = olddf['name'].eq(olddf.index.map(dic))

newdf = olddf[mask]

Output:

              name  age
MJ  machael jordon   40
BJ   boris johnson   35
KJ      kim jongun   25

also keeping the non duplicated rows

Simple, add a second mask:

mask2 = ~olddf.index.duplicated(keep=False)

newdf = olddf[mask|mask2]

Answered By: mozway

Answer 2

Take it easy, make it easy. Make the dic a pd.Series and check inclusion using .isin(). Code below

lst={'MJ':'machael jordon', 'BJ':'boris johnson', 'KJ':'kim jongun'}

olddf[olddf.isin(pd.Series(lst, name='name')).any(1)]

Alternatively create a dataframe dict. append the name column to index on both new and old df and then merge on index. left or inner merge would do. Code below

pd.DataFrame(pd.Series(lst, name='name')).set_index('name', append=True).merge(olddf.set_index('name', append=True), how='left',left_index=True, right_index=True )

Outcome

            name  age
MJ  machael jordon   40
BJ   boris johnson   35
KJ      kim jongun   25

Answered By: wwnde

Answer 3

dic = {'MJ':'machael jordon', 'BJ':'boris johnson', 'KJ':'kim jongun'}
olddf.join(pd.Series(dic).to_frame('name'),rsuffix='_2').query("name==name_2")

         name  age          name_2
BJ   boris johnson   35   boris johnson
KJ      kim jongun   25      kim jongun
MJ  machael jordon   40  machael jordon

Answered By: G.G

Get rid of duplicate rows on conditions for a pandas dataframe

Question:

Answers:

also keeping the non duplicated rows