How to create a new variable that tells us if the value in a given column is unique or not?

Question:

I’m working on the titanic dataset and figured that determining if a ticket is unique or not might be predictive. I’ve written some code below that tells us if the value is unique or not but I have a feeling that there is a much cleaner way of doing this.

 import pandas as pd
dummy_data = pd.DataFrame({"Ticket" : ['A','A','B','B','C','D']
                     })

dummy_data
counts = dummy_data['Ticket'].value_counts()
a = counts[counts>1] 
b=a.reset_index(name='Quantity')
b = b.rename(columns={'index': 'Tickets'})
b
dummy_data['Ticket_multiple'] = np.where(dummy_data['Ticket'].isin(b['Tickets']),1,0)
dummy_data.head(10)
Asked By: Mobix

||

Answers:

You can use pandas.Series.duplicated with keep=False:

Parameters: keep{‘first’, ‘last’, False}, default ‘first’ :
Method to handle dropping duplicates:

  • ‘first’ : Mark duplicates as True except for the first occurrence.
  • ‘last’ : Mark duplicates as True except for the last occurrence.
  • False : Mark all duplicates as True.
dummy_data['Ticket_multiple'] = dummy_data["Ticket"].duplicated(keep=False).astype(int)


Output :

print(dummy_data)

  Ticket  Ticket_multiple
0      A                1
1      A                1
2      B                1
3      B                1
4      C                0
5      D                0

Update :

If you need to consider a value duplicated only if it occurs (>N) times, use this :

#------------------------------------------------ added this value
#                                               ˅
dummy_data = pd.DataFrame({"Ticket" : ['A','A','A','B','B','C','D']})
​
N = 2 #threshold

m = dummy_data.groupby("Ticket").transform("size").gt(N)

dummy_data["Ticket_multiple (threshold)"] = m.astype(int)

print(dummy_data)
  Ticket  Ticket_multiple  Ticket_multiple (threshold)
0      A                1                            1
1      A                1                            1
2      A                1                            1
3      B                1                            0
4      B                1                            0
5      C                0                            0
6      D                0                            0
Answered By: Timeless
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.