Change first element of each group in pandas DataFrame
Question:
I want to ensure that the first value of val2
corresponding to each vintage
is NaN
. Currently two are already NaN
, but I want to ensure that 0.53
also changes to NaN
.
df = pd.DataFrame({
'vintage': ['2017-01-01', '2017-01-01', '2017-01-01', '2017-02-01', '2017-02-01', '2017-03-01'],
'date': ['2017-01-01', '2017-02-01', '2017-03-01', '2017-02-01', '2017-03-01', '2017-03-01'],
'val1': [0.59, 0.68, 0.8, 0.54, 0.61, 0.6],
'val2': [np.nan, 0.66, 0.81, 0.53, 0.62, np.nan]
})
Here’s what I’ve tried so far:
df.groupby('vintage').first().val2 #This gives the first non-NaN values, as shown below
vintage
2017-01-01 0.66
2017-02-01 0.53
2017-03-01 NaN
df.groupby('vintage').first().val2 = np.nan #This doesn't change anything
df.val2
0 NaN
1 0.66
2 0.81
3 0.53
4 0.62
5 NaN
Answers:
You can’t assign to the result of an aggregation, also first
ignores existing NaN
, what you can do is call head(1)
which will return the first row for each group, and pass the indices to loc
to mask the orig df to overwrite those column values:
In[91]
df.loc[df.groupby('vintage')['val2'].head(1).index, 'val2'] = np.NaN
df:
Out[91]:
date val1 val2 vintage
0 2017-01-01 0.59 NaN 2017-01-01
1 2017-02-01 0.68 0.66 2017-01-01
2 2017-03-01 0.80 0.81 2017-01-01
3 2017-02-01 0.54 NaN 2017-02-01
4 2017-03-01 0.61 0.62 2017-02-01
5 2017-03-01 0.60 NaN 2017-03-01
here you can see that head(1)
returns the first row for each group:
In[94]:
df.groupby('vintage')['val2'].head(1)
Out[94]:
0 NaN
3 0.53
5 NaN
Name: val2, dtype: float64
contrast with first
which will return the first non-NaN unless there is only NaN
values for that group:
In[95]:
df.groupby('vintage')['val2'].first()
Out[95]:
vintage
2017-01-01 0.66
2017-02-01 0.53
2017-03-01 NaN
Name: val2, dtype: float64
Or create the Position , pick the first one , change val2 to np.nan
df.loc[df.groupby('vintage').vintage.cumcount()==0,'val2']=np.nan
df
Out[154]:
date val1 val2 vintage
0 2017-01-01 0.59 NaN 2017-01-01
1 2017-02-01 0.68 0.66 2017-01-01
2 2017-03-01 0.80 0.81 2017-01-01
3 2017-02-01 0.54 NaN 2017-02-01
4 2017-03-01 0.61 0.62 2017-02-01
5 2017-03-01 0.60 NaN 2017-03-01
I think you could also write:
def h(x):
x['val2'].iloc[0] = np.NaN
return x
df = df.groupby("vintage").apply(h)
Timings:
df = pd.DataFrame({
'vintage': ['2017-01-01', '2017-01-01', '2017-01-01', '2017-02-01', '2017-02-01', '2017-03-01'],
'date': ['2017-01-01', '2017-02-01', '2017-03-01', '2017-02-01', '2017-03-01', '2017-03-01'],
'val1': [0.59, 0.68, 0.8, 0.54, 0.61, 0.6],
'val2': [np.nan, 0.66, 0.81, 0.53, 0.62, np.nan]
})
def BENY(df):
df.loc[df.groupby('vintage').vintage.cumcount() == 0, 'val2'] = np.nan
def EdChum(df):
df.loc[df.groupby('vintage')['val2'].head(1).index, 'val2'] = np.nan
def knoble(df):
def func(x):
x['val2'].iloc[0] = np.nan
return x
df.groupby("vintage", group_keys=False).apply(func)
%timeit BENY(df)
406 µs ± 4.19 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit EdChum(df)
454 µs ± 1.28 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit knoble(df)
1.07 ms ± 5.55 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
I want to ensure that the first value of val2
corresponding to each vintage
is NaN
. Currently two are already NaN
, but I want to ensure that 0.53
also changes to NaN
.
df = pd.DataFrame({
'vintage': ['2017-01-01', '2017-01-01', '2017-01-01', '2017-02-01', '2017-02-01', '2017-03-01'],
'date': ['2017-01-01', '2017-02-01', '2017-03-01', '2017-02-01', '2017-03-01', '2017-03-01'],
'val1': [0.59, 0.68, 0.8, 0.54, 0.61, 0.6],
'val2': [np.nan, 0.66, 0.81, 0.53, 0.62, np.nan]
})
Here’s what I’ve tried so far:
df.groupby('vintage').first().val2 #This gives the first non-NaN values, as shown below
vintage
2017-01-01 0.66
2017-02-01 0.53
2017-03-01 NaN
df.groupby('vintage').first().val2 = np.nan #This doesn't change anything
df.val2
0 NaN
1 0.66
2 0.81
3 0.53
4 0.62
5 NaN
You can’t assign to the result of an aggregation, also first
ignores existing NaN
, what you can do is call head(1)
which will return the first row for each group, and pass the indices to loc
to mask the orig df to overwrite those column values:
In[91]
df.loc[df.groupby('vintage')['val2'].head(1).index, 'val2'] = np.NaN
df:
Out[91]:
date val1 val2 vintage
0 2017-01-01 0.59 NaN 2017-01-01
1 2017-02-01 0.68 0.66 2017-01-01
2 2017-03-01 0.80 0.81 2017-01-01
3 2017-02-01 0.54 NaN 2017-02-01
4 2017-03-01 0.61 0.62 2017-02-01
5 2017-03-01 0.60 NaN 2017-03-01
here you can see that head(1)
returns the first row for each group:
In[94]:
df.groupby('vintage')['val2'].head(1)
Out[94]:
0 NaN
3 0.53
5 NaN
Name: val2, dtype: float64
contrast with first
which will return the first non-NaN unless there is only NaN
values for that group:
In[95]:
df.groupby('vintage')['val2'].first()
Out[95]:
vintage
2017-01-01 0.66
2017-02-01 0.53
2017-03-01 NaN
Name: val2, dtype: float64
Or create the Position , pick the first one , change val2 to np.nan
df.loc[df.groupby('vintage').vintage.cumcount()==0,'val2']=np.nan
df
Out[154]:
date val1 val2 vintage
0 2017-01-01 0.59 NaN 2017-01-01
1 2017-02-01 0.68 0.66 2017-01-01
2 2017-03-01 0.80 0.81 2017-01-01
3 2017-02-01 0.54 NaN 2017-02-01
4 2017-03-01 0.61 0.62 2017-02-01
5 2017-03-01 0.60 NaN 2017-03-01
I think you could also write:
def h(x):
x['val2'].iloc[0] = np.NaN
return x
df = df.groupby("vintage").apply(h)
Timings:
df = pd.DataFrame({
'vintage': ['2017-01-01', '2017-01-01', '2017-01-01', '2017-02-01', '2017-02-01', '2017-03-01'],
'date': ['2017-01-01', '2017-02-01', '2017-03-01', '2017-02-01', '2017-03-01', '2017-03-01'],
'val1': [0.59, 0.68, 0.8, 0.54, 0.61, 0.6],
'val2': [np.nan, 0.66, 0.81, 0.53, 0.62, np.nan]
})
def BENY(df):
df.loc[df.groupby('vintage').vintage.cumcount() == 0, 'val2'] = np.nan
def EdChum(df):
df.loc[df.groupby('vintage')['val2'].head(1).index, 'val2'] = np.nan
def knoble(df):
def func(x):
x['val2'].iloc[0] = np.nan
return x
df.groupby("vintage", group_keys=False).apply(func)
%timeit BENY(df)
406 µs ± 4.19 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit EdChum(df)
454 µs ± 1.28 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit knoble(df)
1.07 ms ± 5.55 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)