Assign a unique value to consecutive null values untill a non value
Question:
I want to apply a function that does the cumulative count of null values.
The closest solution I came to was this:
import pandas as pd
import numpy as np
# create the column
col = pd.Series([1, 2, np.nan, np.nan, 3, 4, np.nan, np.nan, 5])
col.isnull().cumsum()
But the output is not the way I want:
0 0
1 0
2 1
3 2
4 2
5 2
6 3
7 4
8 4
dtype: int32
I want the output to be the following: [0, 0, 1, 1, 1, 1, 2, 2, 2].
How do I achieve this?
Answers:
You seem to want to count only the first NA per stretch:
m = col.isna()
out = (m & ~m.shift(fill_value=False)).cumsum()
Shortcut:
m = col.isna()
out = (m & m.diff()).cumsum()
Output:
0 0
1 0
2 1
3 1
4 1
5 1
6 2
7 2
8 2
dtype: int64
Intermediates:
col m ~m.shift(fill_value=False) & cumsum
0 1.0 False True False 0
1 2.0 False True False 0
2 NaN True True True 1
3 NaN True False False 1
4 3.0 False False False 1
5 4.0 False True False 1
6 NaN True True True 2
7 NaN True False False 2
8 5.0 False False False 2
Variant:
out = col.isna().astype(int).diff().eq(1).cumsum()
You can use:
# Increment when the previous row is not n/a AND the current row is n/a
out = (col.shift().notna() & col.isna()).cumsum()
print(out)
# Output
0 0
1 0
2 1
3 1
4 1
5 1
6 2
7 2
8 2
dtype: int64
I want to apply a function that does the cumulative count of null values.
The closest solution I came to was this:
import pandas as pd
import numpy as np
# create the column
col = pd.Series([1, 2, np.nan, np.nan, 3, 4, np.nan, np.nan, 5])
col.isnull().cumsum()
But the output is not the way I want:
0 0
1 0
2 1
3 2
4 2
5 2
6 3
7 4
8 4
dtype: int32
I want the output to be the following: [0, 0, 1, 1, 1, 1, 2, 2, 2].
How do I achieve this?
You seem to want to count only the first NA per stretch:
m = col.isna()
out = (m & ~m.shift(fill_value=False)).cumsum()
Shortcut:
m = col.isna()
out = (m & m.diff()).cumsum()
Output:
0 0
1 0
2 1
3 1
4 1
5 1
6 2
7 2
8 2
dtype: int64
Intermediates:
col m ~m.shift(fill_value=False) & cumsum
0 1.0 False True False 0
1 2.0 False True False 0
2 NaN True True True 1
3 NaN True False False 1
4 3.0 False False False 1
5 4.0 False True False 1
6 NaN True True True 2
7 NaN True False False 2
8 5.0 False False False 2
Variant:
out = col.isna().astype(int).diff().eq(1).cumsum()
You can use:
# Increment when the previous row is not n/a AND the current row is n/a
out = (col.shift().notna() & col.isna()).cumsum()
print(out)
# Output
0 0
1 0
2 1
3 1
4 1
5 1
6 2
7 2
8 2
dtype: int64