Pandas, Remove consecutive duplicates grouped by unique identifier

Question:

Current dataframe is as follows:

df = pd.read_csv('filename.csv', delimiter=',')

print(df)
idx   uniqueID      String 

0        1           'hello'
1        1           'goodbye'
2        1           'goodbye'
3        1           'happy'
4        2           'hello'
5        2           'hello'
6        2           'goodbye'
7        3           'goodbye' 
8        3           'hello'
9        3           'hello'
10       4           'hello'
11       5           'goodbye'

Expected Output:

idx   uniqueID      String 

0        1           'hello'
1        1           'goodbye'
3        1           'happy'
4        2           'hello'
6        2           'goodbye'
7        3           'goodbye' 
8        3           'hello'
10       4           'hello'
11       5           'goodbye'

Question: How do I remove the consecutive duplicates only of the same uniqueID?

What I’ve tried to do thus far:

df = df[(df['String '].shift() != df['String ']) | (df['uniqueID'] != df['uniqueID'])]

I’m not sure what case I need to include to ensure it looks specifically at the uniqueID. Any and all suggestions are appreciated. Thanks

Asked By: Ev0

||

Answers:

use shift and compare the uniqueid and string between current and previous.

df= df[~( 
    (df['uniqueID']==df['uniqueID'].shift(1)) & 
    (df['String'].eq(df['String'].shift(1)))
)]
    idx     uniqueID    String
0   0   1   'hello'
1   1   1   'goodbye'
3   3   1   'happy'
4   4   2   'hello'
6   6   2   'goodbye'
7   7   3   'goodbye'
8   8   3   'hello'
10  10  4   'hello'
11  11  5   'goodbye'
Answered By: Naveed

You forget to shift the uniqueID column to ensure same uniqueID. But shifting only works if uniqueID is consecutive, you can sort by uniqueID column first to ensure it is consecutive.

out = (df.sort_values('uniqueID')
       [lambda df: (df['String'].shift() != df['String']) | (df['uniqueID'].shift() != df['uniqueID'])]
       #.sort_index() # since sort values changes index order, you can optionally convert back to original with sort_index
      )
print(out)

    idx  uniqueID     String
0     0         1    'hello'
1     1         1  'goodbye'
3     3         1    'happy'
4     4         2    'hello'
6     6         2  'goodbye'
7     7         3  'goodbye'
8     8         3    'hello'
10   10         4    'hello'
11   11         5  'goodbye'
Answered By: Ynjxsjmh
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.