python pandas dataframe, add column and tag an adjusted and inserted row
Question:
I have the following data-frame
import pandas as pd
df = pd.DataFrame()
df['number'] = (651,651,651,4267,4267,4267,4267,4267,4267,4267,8806,8806,8806,6841,6841,6841,6841)
df['name']=('Alex','Alex','Alex','Ankit','Ankit','Ankit','Ankit','Ankit','Ankit','Ankit','Abhishek','Abhishek','Abhishek','Blake','Blake','Blake','Blake')
df['hours']=(8.25,7.5,7.5,7.5,14,12,15,11,6.5,14,15,15,13.5,8,8,8,8)
df['loc']=('Nar','SCC','RSL','UNIT-C','UNIT-C','UNIT-C','UNIT-C','UNIT-C','UNIT-C','UNIT-C','UNI','UNI','UNI','UNKING','UNKING','UNKING','UNKING')
print(df)
If the running balance of an individuals hours reach 38 an adjustment to the cell that reached the 38th hour is made, a duplicate row is inserted and the balance of hours is added to the following row. The following code performs this and the difference in output of original data to adjusted data can be seen.
s = df.groupby('number')['hours'].cumsum()
m = s.gt(38)
idx = m.groupby(df['number']).idxmax()
delta = s.groupby(df['number']).shift().rsub(38).fillna(s)
out = df.loc[df.index.repeat((df.index.isin(idx)&m)+1)]
out.loc[out.index.duplicated(keep='last'), 'hours'] = delta
out.loc[out.index.duplicated(), 'hours'] -= delta
print(out)
For the row that got adjusted and the row that got inserted I need to tag them via inserting another column and adding a character such as an ‘x’ to highlight the adjusted and inserted row
Answers:
As you duplicate index, you can use out.index.duplicated
as boolean mask:
# or out['mod'] = np.where(out.index.duplicated(keep=False), 'x', '-')
out.loc[out.index.duplicated(keep=False), 'mod'] = 'x'
print(out)
# Output
number name hours loc mod
0 651 Alex 8.25 Nar NaN
1 651 Alex 7.50 SCC NaN
2 651 Alex 7.50 RSL NaN
3 4267 Ankit 7.50 UNIT-C NaN
4 4267 Ankit 14.00 UNIT-C NaN
5 4267 Ankit 12.00 UNIT-C NaN
6 4267 Ankit 4.50 UNIT-C x # index 6
6 4267 Ankit 10.50 UNIT-C x # twice
7 4267 Ankit 11.00 UNIT-C NaN
8 4267 Ankit 6.50 UNIT-C NaN
9 4267 Ankit 14.00 UNIT-C NaN
10 8806 Abhishek 15.00 UNI NaN
11 8806 Abhishek 15.00 UNI NaN
12 8806 Abhishek 8.00 UNI x # index 12
12 8806 Abhishek 5.50 UNI x # twice
13 6841 Blake 8.00 UNKING NaN
14 6841 Blake 8.00 UNKING NaN
15 6841 Blake 8.00 UNKING NaN
16 6841 Blake 8.00 UNKING NaN
I have the following data-frame
import pandas as pd
df = pd.DataFrame()
df['number'] = (651,651,651,4267,4267,4267,4267,4267,4267,4267,8806,8806,8806,6841,6841,6841,6841)
df['name']=('Alex','Alex','Alex','Ankit','Ankit','Ankit','Ankit','Ankit','Ankit','Ankit','Abhishek','Abhishek','Abhishek','Blake','Blake','Blake','Blake')
df['hours']=(8.25,7.5,7.5,7.5,14,12,15,11,6.5,14,15,15,13.5,8,8,8,8)
df['loc']=('Nar','SCC','RSL','UNIT-C','UNIT-C','UNIT-C','UNIT-C','UNIT-C','UNIT-C','UNIT-C','UNI','UNI','UNI','UNKING','UNKING','UNKING','UNKING')
print(df)
If the running balance of an individuals hours reach 38 an adjustment to the cell that reached the 38th hour is made, a duplicate row is inserted and the balance of hours is added to the following row. The following code performs this and the difference in output of original data to adjusted data can be seen.
s = df.groupby('number')['hours'].cumsum()
m = s.gt(38)
idx = m.groupby(df['number']).idxmax()
delta = s.groupby(df['number']).shift().rsub(38).fillna(s)
out = df.loc[df.index.repeat((df.index.isin(idx)&m)+1)]
out.loc[out.index.duplicated(keep='last'), 'hours'] = delta
out.loc[out.index.duplicated(), 'hours'] -= delta
print(out)
For the row that got adjusted and the row that got inserted I need to tag them via inserting another column and adding a character such as an ‘x’ to highlight the adjusted and inserted row
As you duplicate index, you can use out.index.duplicated
as boolean mask:
# or out['mod'] = np.where(out.index.duplicated(keep=False), 'x', '-')
out.loc[out.index.duplicated(keep=False), 'mod'] = 'x'
print(out)
# Output
number name hours loc mod
0 651 Alex 8.25 Nar NaN
1 651 Alex 7.50 SCC NaN
2 651 Alex 7.50 RSL NaN
3 4267 Ankit 7.50 UNIT-C NaN
4 4267 Ankit 14.00 UNIT-C NaN
5 4267 Ankit 12.00 UNIT-C NaN
6 4267 Ankit 4.50 UNIT-C x # index 6
6 4267 Ankit 10.50 UNIT-C x # twice
7 4267 Ankit 11.00 UNIT-C NaN
8 4267 Ankit 6.50 UNIT-C NaN
9 4267 Ankit 14.00 UNIT-C NaN
10 8806 Abhishek 15.00 UNI NaN
11 8806 Abhishek 15.00 UNI NaN
12 8806 Abhishek 8.00 UNI x # index 12
12 8806 Abhishek 5.50 UNI x # twice
13 6841 Blake 8.00 UNKING NaN
14 6841 Blake 8.00 UNKING NaN
15 6841 Blake 8.00 UNKING NaN
16 6841 Blake 8.00 UNKING NaN