How to drop last continuous filled element in pandas

Question:

I wish to drop all last continuous filled entry for pandas column.

Example: For below:

import pandas as pd

df = pd.DataFrame({
    0: ['1/24/2022', '1/25/2022', '1/26/2022', '1/27/2022', '1/28/2022', '1/29/2022', '1/30/2022', '1/31/2022', '2/1/2022', '2/2/2022', '2/3/2022', '2/4/2022', '2/5/2022', '2/6/2022', '2/7/2022', '2/8/2022', '2/9/2022'],
    1: [None, None, 'AB', 'C', 'D', 'Epiphany', None, None, None, None, None, 'A', 'A', 'A', 'B', 'B', None]
})

last_non_empty_row = df.last_valid_index()
last_non_empty_cell = df.loc[last_non_empty_row]

I would like to Convert 'Epiphany' to None and 'B' for '2/7/2022' to None.

Expected output:

df_expected = pd.DataFrame({
    0: ['1/24/2022', '1/25/2022', '1/26/2022', '1/27/2022', '1/28/2022', '1/29/2022', '1/30/2022', '1/31/2022', '2/1/2022', '2/2/2022', '2/3/2022', '2/4/2022', '2/5/2022', '2/6/2022', '2/7/2022', '2/8/2022', '2/9/2022'],
    1: [None, None, 'AB', 'C', 'D', None, None, None, None, None, None, 'A', 'A', 'A', 'B', None, None]
})

How can this be done?

Asked By: user13744439

||

Answers:

You can compare missing values with shifting up by Series.shift and set None if match in DataFrame.loc – if last values is not NaN/None after solution is set this value to None using fill_value=True parameter:

m = df[1].isna()
df.loc[m.shift(-1, fill_value=True) & ~m, 1] = None
print (df)
            0     1
0   1/24/2022  None
1   1/25/2022  None
2   1/26/2022    AB
3   1/27/2022     C
4   1/28/2022     D
5   1/29/2022  None
6   1/30/2022  None
7   1/31/2022  None
8    2/1/2022  None
9    2/2/2022  None
10   2/3/2022  None
11   2/4/2022     A
12   2/5/2022     A
13   2/6/2022     A
14   2/7/2022     B
15   2/8/2022  None
16   2/9/2022  None

Details:

print (m.shift(-1, fill_value=True) & ~m)
0     False
1     False
2     False
3     False
4     False
5      True
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15     True
16    False
Name: 1, dtype: bool

Performance:

#1.02M rows 
df = pd.concat([df] * 60000, ignore_index=True)


In [113]: %%timeit
     ...: m = df[1].isnull()
     ...: 
     ...: df[1] = df.loc[~m, 1].groupby(m.cumsum()).head(-1)
     ...: 
     ...: 
74 ms ± 5.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [114]: %%timeit
     ...: aux = df[1].shift(-1).isnull()
     ...: df[1] = df[1].mask(aux & aux.shift().eq(False), None)
     ...: 
     ...: 
141 ms ± 1.59 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [115]: %%timeit
     ...: aux = df[1].shift(-1).isnull()
     ...: df[1] = np.where(aux & aux.shift().eq(False), None, df[1])
     ...: 
     ...: 
147 ms ± 646 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [116]: %%timeit
     ...: m = df[1].isna()
     ...: df.loc[m.shift(-1, fill_value=True) & ~m, 1] = None
     ...: 
     ...: 
35.2 ms ± 3.93 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Answered By: jezrael

If you want to do both explicitly then:

df[1][df[1]=='Epiphany']=None
df[1][(df[1]=='B') & (df[0]=='2/7/2022')]=None

Edit:

As commented by Corralien,
you can do:

df.loc[df[1]=='Epiphany', 1]=None
df.loc[(df[1]=='B') & (df[0]=='2/7/2022'), 1]=None

to avoid potential SettingWithCopyWarning

Answered By: God Is One

Another possible solution:

aux = df[1].shift(-1).isnull()
df[1] = np.where(aux & aux.shift().eq(False), None, df[1])

Or:

aux = df[1].shift(-1).isnull()
df[1] = df[1].mask(aux & aux.shift().eq(False), None)

Output:

            0     1
0   1/24/2022  None
1   1/25/2022  None
2   1/26/2022    AB
3   1/27/2022     C
4   1/28/2022     D
5   1/29/2022  None
6   1/30/2022  None
7   1/31/2022  None
8    2/1/2022  None
9    2/2/2022  None
10   2/3/2022  None
11   2/4/2022     A
12   2/5/2022     A
13   2/6/2022     A
14   2/7/2022     B
15   2/8/2022  None
16   2/9/2022  None
Answered By: PaulS

Use a custom groupby.head:

# identify null values
m = df[1].isnull()

# groupby consecutive non-null: groupby(m.cumsum())
# get the values except the last per group: head(-1)
# assign back to the column
df[1] = df.loc[~m, 1].groupby(m.cumsum()).head(-1)

Output:

            0    1
0   1/24/2022  NaN
1   1/25/2022  NaN
2   1/26/2022   AB
3   1/27/2022    C
4   1/28/2022    D
5   1/29/2022  NaN
6   1/30/2022  NaN
7   1/31/2022  NaN
8    2/1/2022  NaN
9    2/2/2022  NaN
10   2/3/2022  NaN
11   2/4/2022    A
12   2/5/2022    A
13   2/6/2022    A
14   2/7/2022    B
15   2/8/2022  NaN
16   2/9/2022  NaN
Answered By: mozway
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.