Deleting duplicates from pandas python, deleting only once

Question:

I have a csv file from which I need to remove duplicate rows on the basis of 3 columns. I tried the below code, but it deletes only once not all the possible duplicate.

ins.csv:

sr,instrument_token,exchange_token,tradingsymbol,name
4,367376,2112,nf50,nf50
9,361216,2127,nfbf,nfbf
4,367376,2112,nf50,nf50
9,361216,2127,nfbf,nfbf
4,367376,2112,nf50,nf50
9,361216,2127,nfbf,nfbf
4,367376,2112,nf50,nf50
9,361216,2127,nfbf,nfbf
4,367376,2112,nf50,nf50
9,361216,2127,nfbf,nfbf
4,367376,2112,nf50,nf50
9,361216,2127,nfbf,nfbf

python:

import pandas as pd
import numpy as np

ins = pd.read_csv('ins.csv')

new_ins = ins[pd.DataFrame(np.sort(ins[['instrument_token','exchange_token','tradingsymbol']].values.astype(str),1)).duplicated()]

new_ins.to_csv('ins.csv', mode='w', header=new_ins.columns.tolist(), index=False)
Asked By: koceca

||

Answers:

import pandas as pd
import numpy as np

ins = pd.read_csv('ins.csv')
new_ins = ins.drop_duplicates(['instrument_token','exchange_token','tradingsymbol'], keep='first')

keep=’first’ is actually the default value so no need to add it. It means though that it will keep only the first occurence.

If you want to drop all of them, keep = False

Answered By: Costas

Another way other than using drop_duplicates is to use groupby.nunique.

df.groupby(['sr', 'instrument_token', 'exchange_token', 'tradingsymbol', 'name']).nunique().reset_index()
Out[24]: 
   sr  instrument_token  exchange_token tradingsymbol  name
0   4            367376            2112          nf50  nf50
1   9            361216            2127          nfbf  nfbf
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.