Change first element of each group in pandas DataFrame

Question:

I want to ensure that the first value of val2 corresponding to each vintage is NaN. Currently two are already NaN, but I want to ensure that 0.53 also changes to NaN.

df = pd.DataFrame({
        'vintage': ['2017-01-01', '2017-01-01', '2017-01-01', '2017-02-01', '2017-02-01', '2017-03-01'],
        'date': ['2017-01-01', '2017-02-01', '2017-03-01', '2017-02-01', '2017-03-01', '2017-03-01'],
        'val1': [0.59, 0.68, 0.8, 0.54, 0.61, 0.6],
        'val2': [np.nan, 0.66, 0.81, 0.53, 0.62, np.nan]
    })

Here’s what I’ve tried so far:

df.groupby('vintage').first().val2 #This gives the first non-NaN values, as shown below

vintage
2017-01-01    0.66
2017-02-01    0.53
2017-03-01     NaN

df.groupby('vintage').first().val2 = np.nan #This doesn't change anything
df.val2

0     NaN
1    0.66
2    0.81
3    0.53
4    0.62
5     NaN
Asked By: Gaurav Bansal

||

Answers:

You can’t assign to the result of an aggregation, also first ignores existing NaN, what you can do is call head(1) which will return the first row for each group, and pass the indices to loc to mask the orig df to overwrite those column values:

In[91]
df.loc[df.groupby('vintage')['val2'].head(1).index, 'val2'] = np.NaN
df:

Out[91]: 
         date  val1  val2     vintage
0  2017-01-01  0.59   NaN  2017-01-01
1  2017-02-01  0.68  0.66  2017-01-01
2  2017-03-01  0.80  0.81  2017-01-01
3  2017-02-01  0.54   NaN  2017-02-01
4  2017-03-01  0.61  0.62  2017-02-01
5  2017-03-01  0.60   NaN  2017-03-01

here you can see that head(1) returns the first row for each group:

In[94]:
df.groupby('vintage')['val2'].head(1)
Out[94]: 
0     NaN
3    0.53
5     NaN
Name: val2, dtype: float64

contrast with first which will return the first non-NaN unless there is only NaN values for that group:

In[95]:
df.groupby('vintage')['val2'].first()

Out[95]: 
vintage
2017-01-01    0.66
2017-02-01    0.53
2017-03-01     NaN
Name: val2, dtype: float64
Answered By: EdChum

Or create the Position , pick the first one , change val2 to np.nan

df.loc[df.groupby('vintage').vintage.cumcount()==0,'val2']=np.nan
df
Out[154]: 
         date  val1  val2     vintage
0  2017-01-01  0.59   NaN  2017-01-01
1  2017-02-01  0.68  0.66  2017-01-01
2  2017-03-01  0.80  0.81  2017-01-01
3  2017-02-01  0.54   NaN  2017-02-01
4  2017-03-01  0.61  0.62  2017-02-01
5  2017-03-01  0.60   NaN  2017-03-01
Answered By: BENY

I think you could also write:

def h(x):
 x['val2'].iloc[0] = np.NaN
 return x

df = df.groupby("vintage").apply(h)
Answered By: knoble

Timings:

df = pd.DataFrame({
        'vintage': ['2017-01-01', '2017-01-01', '2017-01-01', '2017-02-01', '2017-02-01', '2017-03-01'],
        'date': ['2017-01-01', '2017-02-01', '2017-03-01', '2017-02-01', '2017-03-01', '2017-03-01'],
        'val1': [0.59, 0.68, 0.8, 0.54, 0.61, 0.6],
        'val2': [np.nan, 0.66, 0.81, 0.53, 0.62, np.nan]
    })

def BENY(df):
    df.loc[df.groupby('vintage').vintage.cumcount() == 0, 'val2'] = np.nan
    
def EdChum(df):
    df.loc[df.groupby('vintage')['val2'].head(1).index, 'val2'] = np.nan
    
def knoble(df):
    def func(x):
        x['val2'].iloc[0] = np.nan
        return x
    df.groupby("vintage", group_keys=False).apply(func)

%timeit BENY(df)
406 µs ± 4.19 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit EdChum(df)
454 µs ± 1.28 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit knoble(df)
1.07 ms ± 5.55 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Answered By: misantroop
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.