How to keep a cumulative count of changes across row elements, ignoring NaNs, and creating a separate column with the results
Question:
I have a data frame that looks like this:
Identification
Date (day/month/year)
X
Y
123
01/01/2022
NaN
abc
123
02/01/2022
200
acb
123
03/01/2022
200
ary
124
01/01/2022
200
abc
124
02/01/2022
NaN
abc
124
03/01/2022
NaN
NaN
I am trying to create two separate ‘change’ columns, one for x and y separately, that is keeping a rolling count of how many times a given element is changing over time. I would like my output to look something like this, where NaN —> NaN is not counted as a change but NaN —> some element is counted:
Identification
Date (day/month/year)
X
Y
Change X
Change Y
123
01/01/2022
NaN
abc
0
0
123
02/01/2022
200
acb
1
1
123
03/01/2022
200
ary
1
2
124
01/01/2022
200
abc
0
0
124
02/01/2022
NaN
abc
1
0
124
03/01/2022
NaN
NaN
1
1
Thanks 🙂
Answers:
You can use a classical comparison with the next item (obtained with groupby.shift
) combined with a groupby.cumsum
, however a NaN compared with another NaN yields False
. To overcome this, we can first fillna
with an object that is not part of the dataset. Here I chose object
, it could be -1
if your data is strictly positive.
def change(s):
s = s.fillna(object)
return (s.ne(s.groupby(df['Identification']).shift())
.groupby(df['Identification']).cumsum().sub(1)
)
out = df.join(df[['X', 'Y']].apply(change).add_prefix('Change '))
print(out)
Output:
Identification Date (day/month/year) X Y Change X Change Y
0 123 01/01/2022 NaN abc 0 0
1 123 02/01/2022 200.0 acb 1 1
2 123 03/01/2022 200.0 ary 1 2
3 124 01/01/2022 200.0 abc 0 0
4 124 02/01/2022 NaN abc 1 0
5 124 03/01/2022 NaN NaN 1 1
I have a data frame that looks like this:
Identification | Date (day/month/year) | X | Y |
---|---|---|---|
123 | 01/01/2022 | NaN | abc |
123 | 02/01/2022 | 200 | acb |
123 | 03/01/2022 | 200 | ary |
124 | 01/01/2022 | 200 | abc |
124 | 02/01/2022 | NaN | abc |
124 | 03/01/2022 | NaN | NaN |
I am trying to create two separate ‘change’ columns, one for x and y separately, that is keeping a rolling count of how many times a given element is changing over time. I would like my output to look something like this, where NaN —> NaN is not counted as a change but NaN —> some element is counted:
Identification | Date (day/month/year) | X | Y | Change X | Change Y |
---|---|---|---|---|---|
123 | 01/01/2022 | NaN | abc | 0 | 0 |
123 | 02/01/2022 | 200 | acb | 1 | 1 |
123 | 03/01/2022 | 200 | ary | 1 | 2 |
124 | 01/01/2022 | 200 | abc | 0 | 0 |
124 | 02/01/2022 | NaN | abc | 1 | 0 |
124 | 03/01/2022 | NaN | NaN | 1 | 1 |
Thanks 🙂
You can use a classical comparison with the next item (obtained with groupby.shift
) combined with a groupby.cumsum
, however a NaN compared with another NaN yields False
. To overcome this, we can first fillna
with an object that is not part of the dataset. Here I chose object
, it could be -1
if your data is strictly positive.
def change(s):
s = s.fillna(object)
return (s.ne(s.groupby(df['Identification']).shift())
.groupby(df['Identification']).cumsum().sub(1)
)
out = df.join(df[['X', 'Y']].apply(change).add_prefix('Change '))
print(out)
Output:
Identification Date (day/month/year) X Y Change X Change Y
0 123 01/01/2022 NaN abc 0 0
1 123 02/01/2022 200.0 acb 1 1
2 123 03/01/2022 200.0 ary 1 2
3 124 01/01/2022 200.0 abc 0 0
4 124 02/01/2022 NaN abc 1 0
5 124 03/01/2022 NaN NaN 1 1