Pandas, Remove consecutive duplicates grouped by unique identifier

Question

Current dataframe is as follows:

df = pd.read_csv('filename.csv', delimiter=',')

print(df)
idx   uniqueID      String 

0        1           'hello'
1        1           'goodbye'
2        1           'goodbye'
3        1           'happy'
4        2           'hello'
5        2           'hello'
6        2           'goodbye'
7        3           'goodbye' 
8        3           'hello'
9        3           'hello'
10       4           'hello'
11       5           'goodbye'

Expected Output:

idx   uniqueID      String 

0        1           'hello'
1        1           'goodbye'
3        1           'happy'
4        2           'hello'
6        2           'goodbye'
7        3           'goodbye' 
8        3           'hello'
10       4           'hello'
11       5           'goodbye'

Question: How do I remove the consecutive duplicates only of the same uniqueID?

What I’ve tried to do thus far:

df = df[(df['String '].shift() != df['String ']) | (df['uniqueID'] != df['uniqueID'])]

I’m not sure what case I need to include to ensure it looks specifically at the uniqueID. Any and all suggestions are appreciated. Thanks

Asked By: Ev0

||

Source

Answer 1

use shift and compare the uniqueid and string between current and previous.

df= df[~( 
    (df['uniqueID']==df['uniqueID'].shift(1)) & 
    (df['String'].eq(df['String'].shift(1)))
)]

    idx     uniqueID    String
0   0   1   'hello'
1   1   1   'goodbye'
3   3   1   'happy'
4   4   2   'hello'
6   6   2   'goodbye'
7   7   3   'goodbye'
8   8   3   'hello'
10  10  4   'hello'
11  11  5   'goodbye'

Answered By: Naveed

Answer 2

You forget to shift the uniqueID column to ensure same uniqueID. But shifting only works if uniqueID is consecutive, you can sort by uniqueID column first to ensure it is consecutive.

out = (df.sort_values('uniqueID')
       [lambda df: (df['String'].shift() != df['String']) | (df['uniqueID'].shift() != df['uniqueID'])]
       #.sort_index() # since sort values changes index order, you can optionally convert back to original with sort_index
      )

print(out)

    idx  uniqueID     String
0     0         1    'hello'
1     1         1  'goodbye'
3     3         1    'happy'
4     4         2    'hello'
6     6         2  'goodbye'
7     7         3  'goodbye'
8     8         3    'hello'
10   10         4    'hello'
11   11         5  'goodbye'

Answered By: Ynjxsjmh

Pandas, Remove consecutive duplicates grouped by unique identifier

Question:

Answers: