merge rows into new column value
Question:
I am taking a df that is all dup value pairs and then from the 2nd row take the 2nd column value and add it to the first row in a new column called ‘new_amt’ then inserting NaN for the second row and new third column. After I’ll drop all row that contain NaN.
so the dataframe look like this:
ref_num
Amt
row 1
1
10
row 2
1
20
row 3
2
5
row 4
2
15
row 5
3
12
row 6
3
7
after it should look like this:
ref_num
Amt
new_Amt
row 1
1
10
20
row 2
1
20
NaN
row 3
2
5
15
row 4
2
15
NaN
row 5
3
12
7
row 6
3
7
NaN
I thought a lambda function could work where I’d have the else statement return NaN for all the second dup rows but I could figure out the syntax.
df[‘new_Amt’] = df.apply(lambda x : x[‘Amt’] if x[‘ref_num’] == x[‘ref_num’] else x[‘new_Amt’] is NaN)
Answers:
Why not do both operations at once (resolve duplicates as you describe and drop the redundant rows)?
k = 'ref_num'
newdf = df.drop_duplicates(subset=k, keep='first').merge(
df.drop_duplicates(subset=k, keep='last'), on='ref_num', suffixes=('', '_new'))
>>> newdf
ref_num Amt Amt_new
0 1 10 20
1 2 5 15
2 3 12 7
Another possibility:
gb = df.groupby('ref_num')['Amt']
newdf = pd.concat([gb.first(), gb.last()], axis=1, keys=['Amt', 'new_Amt']).reset_index()
>>> newdf
ref_num Amt new_Amt
0 1 10 20
1 2 5 15
2 3 12 7
Note: in your question it is not clear if 'row 1'
, 'row 2'
etc. are indices, meant to be kept or not, etc. If they are desired in the final output, please let us know if and how they should appear.
I am taking a df that is all dup value pairs and then from the 2nd row take the 2nd column value and add it to the first row in a new column called ‘new_amt’ then inserting NaN for the second row and new third column. After I’ll drop all row that contain NaN.
so the dataframe look like this:
ref_num | Amt | |
---|---|---|
row 1 | 1 | 10 |
row 2 | 1 | 20 |
row 3 | 2 | 5 |
row 4 | 2 | 15 |
row 5 | 3 | 12 |
row 6 | 3 | 7 |
after it should look like this:
ref_num | Amt | new_Amt | |
---|---|---|---|
row 1 | 1 | 10 | 20 |
row 2 | 1 | 20 | NaN |
row 3 | 2 | 5 | 15 |
row 4 | 2 | 15 | NaN |
row 5 | 3 | 12 | 7 |
row 6 | 3 | 7 | NaN |
I thought a lambda function could work where I’d have the else statement return NaN for all the second dup rows but I could figure out the syntax.
df[‘new_Amt’] = df.apply(lambda x : x[‘Amt’] if x[‘ref_num’] == x[‘ref_num’] else x[‘new_Amt’] is NaN)
Why not do both operations at once (resolve duplicates as you describe and drop the redundant rows)?
k = 'ref_num'
newdf = df.drop_duplicates(subset=k, keep='first').merge(
df.drop_duplicates(subset=k, keep='last'), on='ref_num', suffixes=('', '_new'))
>>> newdf
ref_num Amt Amt_new
0 1 10 20
1 2 5 15
2 3 12 7
Another possibility:
gb = df.groupby('ref_num')['Amt']
newdf = pd.concat([gb.first(), gb.last()], axis=1, keys=['Amt', 'new_Amt']).reset_index()
>>> newdf
ref_num Amt new_Amt
0 1 10 20
1 2 5 15
2 3 12 7
Note: in your question it is not clear if 'row 1'
, 'row 2'
etc. are indices, meant to be kept or not, etc. If they are desired in the final output, please let us know if and how they should appear.