Pandas drop nan using first valid index by group
Question:
I’m working with the following DataFrame:
Date Id Amount
0 201301 1 nan
1 201302 1 nan
2 201303 1 100
3 201304 1 120
4 201305 1 nan
5 201306 1 120
6 201302 2 nan
7 201303 2 150
8 201304 2 180
I’m trying to get the first valid index of Amount
by Id
. Because of some reason this doesn’t work:
df.groupby('Id').Amount.first_valid_index()
I’m also trying this:
df.groupby('Id').Amount.apply(lambda x: x.first_valid_index())
But my dataset is 20M+ rows, so it’s taking too long and that won’t work for me.
Is there any faster way to find the first index by group?
My desired output would be:
first_idx = [2,7]
Or even better:
Date Id Amount
2 201303 1 100
3 201304 1 120
4 201305 1 nan
5 201306 1 120
7 201303 2 150
8 201304 2 180
Edit: df.groupby('Id').Amount.apply(lambda x: x.first_valid_index())
indeed works, but I have the feeling there has to be a faster option, the problem doesn’t seem to be that complex.
Answers:
Option 1: To get just the first indexes:
df[df.Amount.notna()].groupby('Id').Date.idxmin()
# 1.42 ms ± 14.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
outputs:
Id
1 2
2 7
Name: Date, dtype: int64
Option 2: to get the other rows, use cumsum
on notna()
df[df['Amount'].notna().groupby(df['Id']).cumsum().gt(0)]
# 2.09 ms ± 220 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Option 3: you can ffill()
within group and choose those are not filled:
df[df.groupby('Id').Amount.ffill().notna()]
# 831 µs ± 14.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Output:
Date Id Amount
2 201303 1 100.0
3 201304 1 120.0
4 201305 1 NaN
5 201306 1 120.0
7 201303 2 150.0
8 201304 2 180.0
Conclusion: Option 3 is the fastest!
Update: to filter both ends using Option 3:
amt_group = df.groupby('Id').Amount
df[amt_group.bfill().notna() & amt_group.ffill().notna()]
Create a mask with .notnull
+ .cumsum
to get everything after the first non-null Amount
within the group. Then make a slice.
m = df.Amount.notnull().groupby(df.Id).cumsum().ge(1)
df.loc[m]
Date Id Amount
2 201303 1 100.0
3 201304 1 120.0
4 201305 1 NaN
5 201306 1 120.0
7 201303 2 150.0
8 201304 2 180.0
In case shifting is also required (e.g. start from the index before the first valid index). Using ffill() and shift() as chained calls would shift the frame and not the groups.
ffilled = df['Amount'].groupby('Id').ffill()
df[ffilled.groupby('Id').shift(-1, fill_value=True).notna()]
I’m working with the following DataFrame:
Date Id Amount
0 201301 1 nan
1 201302 1 nan
2 201303 1 100
3 201304 1 120
4 201305 1 nan
5 201306 1 120
6 201302 2 nan
7 201303 2 150
8 201304 2 180
I’m trying to get the first valid index of Amount
by Id
. Because of some reason this doesn’t work:
df.groupby('Id').Amount.first_valid_index()
I’m also trying this:
df.groupby('Id').Amount.apply(lambda x: x.first_valid_index())
But my dataset is 20M+ rows, so it’s taking too long and that won’t work for me.
Is there any faster way to find the first index by group?
My desired output would be:
first_idx = [2,7]
Or even better:
Date Id Amount
2 201303 1 100
3 201304 1 120
4 201305 1 nan
5 201306 1 120
7 201303 2 150
8 201304 2 180
Edit: df.groupby('Id').Amount.apply(lambda x: x.first_valid_index())
indeed works, but I have the feeling there has to be a faster option, the problem doesn’t seem to be that complex.
Option 1: To get just the first indexes:
df[df.Amount.notna()].groupby('Id').Date.idxmin()
# 1.42 ms ± 14.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
outputs:
Id
1 2
2 7
Name: Date, dtype: int64
Option 2: to get the other rows, use cumsum
on notna()
df[df['Amount'].notna().groupby(df['Id']).cumsum().gt(0)]
# 2.09 ms ± 220 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Option 3: you can ffill()
within group and choose those are not filled:
df[df.groupby('Id').Amount.ffill().notna()]
# 831 µs ± 14.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Output:
Date Id Amount
2 201303 1 100.0
3 201304 1 120.0
4 201305 1 NaN
5 201306 1 120.0
7 201303 2 150.0
8 201304 2 180.0
Conclusion: Option 3 is the fastest!
Update: to filter both ends using Option 3:
amt_group = df.groupby('Id').Amount
df[amt_group.bfill().notna() & amt_group.ffill().notna()]
Create a mask with .notnull
+ .cumsum
to get everything after the first non-null Amount
within the group. Then make a slice.
m = df.Amount.notnull().groupby(df.Id).cumsum().ge(1)
df.loc[m]
Date Id Amount
2 201303 1 100.0
3 201304 1 120.0
4 201305 1 NaN
5 201306 1 120.0
7 201303 2 150.0
8 201304 2 180.0
In case shifting is also required (e.g. start from the index before the first valid index). Using ffill() and shift() as chained calls would shift the frame and not the groups.
ffilled = df['Amount'].groupby('Id').ffill()
df[ffilled.groupby('Id').shift(-1, fill_value=True).notna()]