Vectorized way to create a column based on indexes stored in another column
Question:
I have a column that stores the indexes of the last valid index of another column in a rolling window. This was done based on this answer.
So e.g. we had
d = {'col': [True, False, True, True, False, False]}
df = pd.DataFrame(data=d)
and then we got the last valid index in a rolling window with
df['new'] = df.index
df['new'] = df['new'].where(df.col).ffill().rolling(3).max()
0 NaN
1 NaN
2 2.0
3 3.0
4 3.0
5 3.0
How can I use those indexes to store to a new column new_col
the values of a different column col_b
in the same dataframe at the indexes recorded above?
e.g. if a different column col_b
was
'col_b': [100, 200, 300, 400, 500, 600]
then the expected outcome of new_col
based on the indexes above would be
0 NaN
1 NaN
2 300
3 400
4 400
5 400
PS. Let me know if it’s easier to directly use the initial col
for this purpose in some way (on a rolling window always)
Answers:
One idea is create index by col_b
and then call Series.idxmax
for indices by maximal values from original index:
df = df.set_index('col_b')
df['new']=df.index.to_series().where(df.col).ffill().rolling(3).apply(lambda x: x.idxmax())
df = df.reset_index(drop=True)
print (df)
col new
0 True NaN
1 False NaN
2 True 300.0
3 True 400.0
4 False 400.0
5 False 400.0
In solution is possible add Series.reindex
for values by df['new']
, because duplicated index is necessary recreate original indices:
df['new'] = df[['col_b']].reindex(df['new']).set_index(df.index)
print (df)
col col_b new
0 True 100 NaN
1 False 200 NaN
2 True 300 300.0
3 True 400 400.0
4 False 500 400.0
5 False 600 400.0
Or if always RangeIndex is posible use numpy indexing with remove missing values and casting to integers:
s = df['new'].dropna().astype(int)
df['new'] = pd.Series(df['col_b'].to_numpy()[s], index=s.index)
print (df)
col col_b new
0 True 100 NaN
1 False 200 NaN
2 True 300 300.0
3 True 400 400.0
4 False 500 400.0
5 False 600 400.0
Btw, your solution is possible simplify:
df['new'] = df.index.to_series().where(df.col).ffill().rolling(3).max()
Does this work? What it does is use df['new']
as the indices to access values from df['col_b']
. This requires converting df['new']
to int, so it has some intermediate steps of replacing the nans
with 0s
, then putting the nans
back into the new column.
new_as_idx = df['new'].copy()
new_as_idx[np.isnan(new_as_idx)] = 0
new_as_idx = new_as_idx.astype(int)
new_b = df['col_b'].to_numpy()[new_as_idx]
new_b = new_b.astype('float')
new_b[np.isnan(df['new'])] = np.nan
df['new_b'] = new_b
I have a column that stores the indexes of the last valid index of another column in a rolling window. This was done based on this answer.
So e.g. we had
d = {'col': [True, False, True, True, False, False]}
df = pd.DataFrame(data=d)
and then we got the last valid index in a rolling window with
df['new'] = df.index
df['new'] = df['new'].where(df.col).ffill().rolling(3).max()
0 NaN
1 NaN
2 2.0
3 3.0
4 3.0
5 3.0
How can I use those indexes to store to a new column new_col
the values of a different column col_b
in the same dataframe at the indexes recorded above?
e.g. if a different column col_b
was
'col_b': [100, 200, 300, 400, 500, 600]
then the expected outcome of new_col
based on the indexes above would be
0 NaN
1 NaN
2 300
3 400
4 400
5 400
PS. Let me know if it’s easier to directly use the initial col
for this purpose in some way (on a rolling window always)
One idea is create index by col_b
and then call Series.idxmax
for indices by maximal values from original index:
df = df.set_index('col_b')
df['new']=df.index.to_series().where(df.col).ffill().rolling(3).apply(lambda x: x.idxmax())
df = df.reset_index(drop=True)
print (df)
col new
0 True NaN
1 False NaN
2 True 300.0
3 True 400.0
4 False 400.0
5 False 400.0
In solution is possible add Series.reindex
for values by df['new']
, because duplicated index is necessary recreate original indices:
df['new'] = df[['col_b']].reindex(df['new']).set_index(df.index)
print (df)
col col_b new
0 True 100 NaN
1 False 200 NaN
2 True 300 300.0
3 True 400 400.0
4 False 500 400.0
5 False 600 400.0
Or if always RangeIndex is posible use numpy indexing with remove missing values and casting to integers:
s = df['new'].dropna().astype(int)
df['new'] = pd.Series(df['col_b'].to_numpy()[s], index=s.index)
print (df)
col col_b new
0 True 100 NaN
1 False 200 NaN
2 True 300 300.0
3 True 400 400.0
4 False 500 400.0
5 False 600 400.0
Btw, your solution is possible simplify:
df['new'] = df.index.to_series().where(df.col).ffill().rolling(3).max()
Does this work? What it does is use df['new']
as the indices to access values from df['col_b']
. This requires converting df['new']
to int, so it has some intermediate steps of replacing the nans
with 0s
, then putting the nans
back into the new column.
new_as_idx = df['new'].copy()
new_as_idx[np.isnan(new_as_idx)] = 0
new_as_idx = new_as_idx.astype(int)
new_b = df['col_b'].to_numpy()[new_as_idx]
new_b = new_b.astype('float')
new_b[np.isnan(df['new'])] = np.nan
df['new_b'] = new_b