Pandas drop nan using first valid index by group

Question:

I’m working with the following DataFrame:

         Date    Id    Amount
   0    201301    1      nan
   1    201302    1      nan
   2    201303    1      100
   3    201304    1      120
   4    201305    1      nan
   5    201306    1      120
   6    201302    2      nan
   7    201303    2      150
   8    201304    2      180

I’m trying to get the first valid index of Amount by Id. Because of some reason this doesn’t work:

df.groupby('Id').Amount.first_valid_index()

I’m also trying this:

df.groupby('Id').Amount.apply(lambda x: x.first_valid_index())

But my dataset is 20M+ rows, so it’s taking too long and that won’t work for me.

Is there any faster way to find the first index by group?

My desired output would be:

first_idx = [2,7]

Or even better:

         Date    Id    Amount

   2    201303    1      100
   3    201304    1      120
   4    201305    1      nan
   5    201306    1      120
   7    201303    2      150
   8    201304    2      180

Edit: df.groupby('Id').Amount.apply(lambda x: x.first_valid_index()) indeed works, but I have the feeling there has to be a faster option, the problem doesn’t seem to be that complex.

Asked By: Juan C

||

Answers:

Option 1: To get just the first indexes:

df[df.Amount.notna()].groupby('Id').Date.idxmin()
# 1.42 ms ± 14.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

outputs:

Id
1    2
2    7
Name: Date, dtype: int64

Option 2: to get the other rows, use cumsum on notna()

df[df['Amount'].notna().groupby(df['Id']).cumsum().gt(0)]
# 2.09 ms ± 220 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Option 3: you can ffill() within group and choose those are not filled:

df[df.groupby('Id').Amount.ffill().notna()]
# 831 µs ± 14.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Output:

     Date  Id  Amount
2  201303   1   100.0
3  201304   1   120.0
4  201305   1     NaN
5  201306   1   120.0
7  201303   2   150.0
8  201304   2   180.0

Conclusion: Option 3 is the fastest!


Update: to filter both ends using Option 3:

amt_group = df.groupby('Id').Amount
df[amt_group.bfill().notna() & amt_group.ffill().notna()]
Answered By: Quang Hoang

Create a mask with .notnull + .cumsum to get everything after the first non-null Amount within the group. Then make a slice.

m = df.Amount.notnull().groupby(df.Id).cumsum().ge(1)

df.loc[m]
     Date  Id  Amount
2  201303   1   100.0
3  201304   1   120.0
4  201305   1     NaN
5  201306   1   120.0
7  201303   2   150.0
8  201304   2   180.0
Answered By: ALollz

In case shifting is also required (e.g. start from the index before the first valid index). Using ffill() and shift() as chained calls would shift the frame and not the groups.

ffilled = df['Amount'].groupby('Id').ffill()
df[ffilled.groupby('Id').shift(-1, fill_value=True).notna()]
Answered By: misantroop
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.