Identify Duplicated rows with Additional Column
Question:
I have the following Dataframe:
PplNum RoomNum Value
0 1 0 265
1 1 12 170
2 2 0 297
3 2 12 85
4 2 0 41
5 2 12 144
Generally the PplNum
and RoomNum
is generated like this, and it will always follow this format:
for ppl in [1,2,2]:
for room in [0, 12]:
print(ppl, room)
1 0
1 12
2 0
2 12
2 0
2 12
But now what I would like to achieve is to mark those duplicates combinations of PplNum
and RoomNum
so that I can know which combinationss are the first occurrence, which are the second occurrence and so on… So the expected output Dataframe will be like this:
PplNum RoomNum Value C
0 1 0 265 1
1 1 12 170 1
2 2 0 297 1
3 2 12 85 1
4 2 0 41 2
5 2 12 144 2
Answers:
you can do it using groupby() together with cumcount() function:
In [102]: df['C'] = df.groupby(['PplNum','RoomNum']).cumcount() + 1
In [103]: df
Out[103]:
PplNum RoomNum Value C
0 1 0 265 1
1 1 12 170 1
2 2 0 297 1
3 2 12 85 1
4 2 0 41 2
5 2 12 144 2
Explanation:
In [101]: df.groupby(['PplNum','RoomNum']).cumcount() + 1
Out[101]:
0 1
1 1
2 1
3 1
4 2
5 2
dtype: int64
Here is my approach with a recursive function:
dfnondup = df.drop_duplicates(['PplNum', 'RoomNum'])
def rename_dup(df):
def rename_dup(df, c, dfnew):
dfnondup = df.drop_duplicates(['PplNum', 'RoomNum'])
dfnondup['C'] = pd.Series([c] * len(dfnondup), index=dfnondup.index)
dfnew = pd.concat([dfnew, dfnondup], axis=0)
c += 1
dfdup = df[df.duplicated(['PplNum', 'RoomNum'])]
if dfdup.empty:
return dfnew, c
else:
return rename_dup(dfdup, c, dfnew)
return rename_dup(df, 1, pd.DataFrame())
dfnew, c = rename_dup(df)
The result dfnew
will be
dfnew
Out[28]:
PplNum RoomNum Value C
0 1 0 265 1
1 1 12 170 1
2 2 0 297 1
3 2 12 85 1
4 2 0 41 2
5 2 12 144 2
I have the following Dataframe:
PplNum RoomNum Value
0 1 0 265
1 1 12 170
2 2 0 297
3 2 12 85
4 2 0 41
5 2 12 144
Generally the PplNum
and RoomNum
is generated like this, and it will always follow this format:
for ppl in [1,2,2]:
for room in [0, 12]:
print(ppl, room)
1 0
1 12
2 0
2 12
2 0
2 12
But now what I would like to achieve is to mark those duplicates combinations of PplNum
and RoomNum
so that I can know which combinationss are the first occurrence, which are the second occurrence and so on… So the expected output Dataframe will be like this:
PplNum RoomNum Value C
0 1 0 265 1
1 1 12 170 1
2 2 0 297 1
3 2 12 85 1
4 2 0 41 2
5 2 12 144 2
you can do it using groupby() together with cumcount() function:
In [102]: df['C'] = df.groupby(['PplNum','RoomNum']).cumcount() + 1
In [103]: df
Out[103]:
PplNum RoomNum Value C
0 1 0 265 1
1 1 12 170 1
2 2 0 297 1
3 2 12 85 1
4 2 0 41 2
5 2 12 144 2
Explanation:
In [101]: df.groupby(['PplNum','RoomNum']).cumcount() + 1
Out[101]:
0 1
1 1
2 1
3 1
4 2
5 2
dtype: int64
Here is my approach with a recursive function:
dfnondup = df.drop_duplicates(['PplNum', 'RoomNum'])
def rename_dup(df):
def rename_dup(df, c, dfnew):
dfnondup = df.drop_duplicates(['PplNum', 'RoomNum'])
dfnondup['C'] = pd.Series([c] * len(dfnondup), index=dfnondup.index)
dfnew = pd.concat([dfnew, dfnondup], axis=0)
c += 1
dfdup = df[df.duplicated(['PplNum', 'RoomNum'])]
if dfdup.empty:
return dfnew, c
else:
return rename_dup(dfdup, c, dfnew)
return rename_dup(df, 1, pd.DataFrame())
dfnew, c = rename_dup(df)
The result dfnew
will be
dfnew
Out[28]:
PplNum RoomNum Value C
0 1 0 265 1
1 1 12 170 1
2 2 0 297 1
3 2 12 85 1
4 2 0 41 2
5 2 12 144 2