Pandas, Remove consecutive duplicates grouped by unique identifier
Question:
Current dataframe is as follows:
df = pd.read_csv('filename.csv', delimiter=',')
print(df)
idx uniqueID String
0 1 'hello'
1 1 'goodbye'
2 1 'goodbye'
3 1 'happy'
4 2 'hello'
5 2 'hello'
6 2 'goodbye'
7 3 'goodbye'
8 3 'hello'
9 3 'hello'
10 4 'hello'
11 5 'goodbye'
Expected Output:
idx uniqueID String
0 1 'hello'
1 1 'goodbye'
3 1 'happy'
4 2 'hello'
6 2 'goodbye'
7 3 'goodbye'
8 3 'hello'
10 4 'hello'
11 5 'goodbye'
Question: How do I remove the consecutive duplicates only of the same uniqueID?
What I’ve tried to do thus far:
df = df[(df['String '].shift() != df['String ']) | (df['uniqueID'] != df['uniqueID'])]
I’m not sure what case I need to include to ensure it looks specifically at the uniqueID. Any and all suggestions are appreciated. Thanks
Answers:
use shift and compare the uniqueid and string between current and previous.
df= df[~(
(df['uniqueID']==df['uniqueID'].shift(1)) &
(df['String'].eq(df['String'].shift(1)))
)]
idx uniqueID String
0 0 1 'hello'
1 1 1 'goodbye'
3 3 1 'happy'
4 4 2 'hello'
6 6 2 'goodbye'
7 7 3 'goodbye'
8 8 3 'hello'
10 10 4 'hello'
11 11 5 'goodbye'
You forget to shift the uniqueID
column to ensure same uniqueID. But shifting only works if uniqueID
is consecutive, you can sort by uniqueID
column first to ensure it is consecutive
.
out = (df.sort_values('uniqueID')
[lambda df: (df['String'].shift() != df['String']) | (df['uniqueID'].shift() != df['uniqueID'])]
#.sort_index() # since sort values changes index order, you can optionally convert back to original with sort_index
)
print(out)
idx uniqueID String
0 0 1 'hello'
1 1 1 'goodbye'
3 3 1 'happy'
4 4 2 'hello'
6 6 2 'goodbye'
7 7 3 'goodbye'
8 8 3 'hello'
10 10 4 'hello'
11 11 5 'goodbye'
Current dataframe is as follows:
df = pd.read_csv('filename.csv', delimiter=',')
print(df)
idx uniqueID String
0 1 'hello'
1 1 'goodbye'
2 1 'goodbye'
3 1 'happy'
4 2 'hello'
5 2 'hello'
6 2 'goodbye'
7 3 'goodbye'
8 3 'hello'
9 3 'hello'
10 4 'hello'
11 5 'goodbye'
Expected Output:
idx uniqueID String
0 1 'hello'
1 1 'goodbye'
3 1 'happy'
4 2 'hello'
6 2 'goodbye'
7 3 'goodbye'
8 3 'hello'
10 4 'hello'
11 5 'goodbye'
Question: How do I remove the consecutive duplicates only of the same uniqueID?
What I’ve tried to do thus far:
df = df[(df['String '].shift() != df['String ']) | (df['uniqueID'] != df['uniqueID'])]
I’m not sure what case I need to include to ensure it looks specifically at the uniqueID. Any and all suggestions are appreciated. Thanks
use shift and compare the uniqueid and string between current and previous.
df= df[~(
(df['uniqueID']==df['uniqueID'].shift(1)) &
(df['String'].eq(df['String'].shift(1)))
)]
idx uniqueID String
0 0 1 'hello'
1 1 1 'goodbye'
3 3 1 'happy'
4 4 2 'hello'
6 6 2 'goodbye'
7 7 3 'goodbye'
8 8 3 'hello'
10 10 4 'hello'
11 11 5 'goodbye'
You forget to shift the uniqueID
column to ensure same uniqueID. But shifting only works if uniqueID
is consecutive, you can sort by uniqueID
column first to ensure it is consecutive
.
out = (df.sort_values('uniqueID')
[lambda df: (df['String'].shift() != df['String']) | (df['uniqueID'].shift() != df['uniqueID'])]
#.sort_index() # since sort values changes index order, you can optionally convert back to original with sort_index
)
print(out)
idx uniqueID String
0 0 1 'hello'
1 1 1 'goodbye'
3 3 1 'happy'
4 4 2 'hello'
6 6 2 'goodbye'
7 7 3 'goodbye'
8 8 3 'hello'
10 10 4 'hello'
11 11 5 'goodbye'