Pandas: Create new column with repeating values based on non-repeating values in another column
Question:
I have a dataframe with the following column the follows this format:
df = pd.DataFrame(data={
'value': [123, 456, 789, 111, 121, 34523, 4352, 45343, 623]
'repeatVal': ['NaN', 2, 'NaN', 'NaN', 3, 'NaN', 'NaN', 'NaN', 'NaN'],
})
I want to create a new column that takes the values from ‘value’ and repeats it the number of times downward from ‘repeatVal’ so the output looks like ‘result’:
df = pd.DataFrame(data={
'value': [123, 456, 789, 111, 121, 34523, 4352, 45343, 623]
'repeatVal': ['NaN', 2, 'NaN', 'NaN', 3, 'NaN', 'NaN', 'NaN', 'NaN'],
'result': ['NaN', 456, 456, 'NaN', 121, 121, 121, 'NaN', 'NaN']
})
To be clear, I do not want to duplicate the rows, I only want to create a new col where values are repeated n times, where n is specified in a different column. The format of the column ‘repeatVals’ is such that there will never be overlap–that there will always be sufficient NaN values between the repeat indicators in ‘repeatVals’
I have read the docs on np.repeat and np.tile but those don’t appear to solve this issue.
Answers:
One option using groupby.cumcount
as masks:
df = df.replace('NaN', float('nan'))
m1 = df['repeatVal'].notna()
m2 = df.groupby(m1.cumsum()).cumcount().lt(df['repeatVal'].ffill())
df['result'] = df['value'].where(m1).ffill().where(m2)
Output:
value repeatVal result
0 123 NaN NaN
1 456 2.0 456.0
2 789 NaN 456.0
3 111 NaN NaN
4 121 3.0 121.0
5 34523 NaN 121.0
6 4352 NaN 121.0
7 45343 NaN NaN
8 623 NaN NaN
Intermediates:
value repeatVal result m1 m1.cumsum() cumcount cumcount < repeatVal.ffill() value/masked/ffill
0 123 NaN NaN False 0 0 False NaN
1 456 2.0 456.0 True 1 0 True 456.0
2 789 NaN 456.0 False 1 1 True 456.0
3 111 NaN NaN False 1 2 False 456.0
4 121 3.0 121.0 True 2 0 True 121.0
5 34523 NaN 121.0 False 2 1 True 121.0
6 4352 NaN 121.0 False 2 2 True 121.0
7 45343 NaN NaN False 2 3 False 121.0
8 623 NaN NaN False 2 4 False 121.0
Here is a way using index.repeat
((v := df.loc[df.index.repeat(df['repeatVal'].fillna(0)),'value'])
.set_axis(v.groupby(v).cumcount() + v.index))
Output:
1 456
2 456
4 121
5 121
6 121
I have a dataframe with the following column the follows this format:
df = pd.DataFrame(data={
'value': [123, 456, 789, 111, 121, 34523, 4352, 45343, 623]
'repeatVal': ['NaN', 2, 'NaN', 'NaN', 3, 'NaN', 'NaN', 'NaN', 'NaN'],
})
I want to create a new column that takes the values from ‘value’ and repeats it the number of times downward from ‘repeatVal’ so the output looks like ‘result’:
df = pd.DataFrame(data={
'value': [123, 456, 789, 111, 121, 34523, 4352, 45343, 623]
'repeatVal': ['NaN', 2, 'NaN', 'NaN', 3, 'NaN', 'NaN', 'NaN', 'NaN'],
'result': ['NaN', 456, 456, 'NaN', 121, 121, 121, 'NaN', 'NaN']
})
To be clear, I do not want to duplicate the rows, I only want to create a new col where values are repeated n times, where n is specified in a different column. The format of the column ‘repeatVals’ is such that there will never be overlap–that there will always be sufficient NaN values between the repeat indicators in ‘repeatVals’
I have read the docs on np.repeat and np.tile but those don’t appear to solve this issue.
One option using groupby.cumcount
as masks:
df = df.replace('NaN', float('nan'))
m1 = df['repeatVal'].notna()
m2 = df.groupby(m1.cumsum()).cumcount().lt(df['repeatVal'].ffill())
df['result'] = df['value'].where(m1).ffill().where(m2)
Output:
value repeatVal result
0 123 NaN NaN
1 456 2.0 456.0
2 789 NaN 456.0
3 111 NaN NaN
4 121 3.0 121.0
5 34523 NaN 121.0
6 4352 NaN 121.0
7 45343 NaN NaN
8 623 NaN NaN
Intermediates:
value repeatVal result m1 m1.cumsum() cumcount cumcount < repeatVal.ffill() value/masked/ffill
0 123 NaN NaN False 0 0 False NaN
1 456 2.0 456.0 True 1 0 True 456.0
2 789 NaN 456.0 False 1 1 True 456.0
3 111 NaN NaN False 1 2 False 456.0
4 121 3.0 121.0 True 2 0 True 121.0
5 34523 NaN 121.0 False 2 1 True 121.0
6 4352 NaN 121.0 False 2 2 True 121.0
7 45343 NaN NaN False 2 3 False 121.0
8 623 NaN NaN False 2 4 False 121.0
Here is a way using index.repeat
((v := df.loc[df.index.repeat(df['repeatVal'].fillna(0)),'value'])
.set_axis(v.groupby(v).cumcount() + v.index))
Output:
1 456
2 456
4 121
5 121
6 121