Pandas setting column values to adjacent ones for duplicates found in a key column

Question:

Assuming we have the following table:

+---------+------+------+------+------+------+------+------+------+
| COL_ID  | ColB | COLC | COLD | COLE | COLF | COLG | COLH | COLI |
+---------+------+------+------+------+------+------+------+------+
| aa1     |    1 |    1 |      |      |      |      |      |      |
| aa1     | 2    |    1 |      |      |      |      |      |      |
| aa2     |    3 |    1 |      |      |      |      |      |      |
| ab3     |    6 |    2 |      |      |      |      |      |      |
| ab3     |    5 |    2 |      |      |      |      |      |      |
| ab3     |    7 |    1 |      |      |      |      |      |      |
| ab3     |    1 |    1 |      |      |      |      |      |      |
+---------+------+------+------+------+------+------+------+------+

How can we assign the values of duplicates in the adjacent column if a duplicate is found?

+---------+------+------+------+------+------+------+------+------+
| COL_ID  | ColB | COLC | COLD | COLE | COLF | COLG | COLH | COLI |
+---------+------+------+------+------+------+------+------+------+
| aa1     |    1 |    1 |    1 |    1 |    2 |    1 |      |      |
| aa2     |    3 |    1 |      |      |      |      |      |      |
| ab3     |    6 |    2 |    5 |    2 |    7 |    1 |    1 |    1 |
+---------+------+------+------+------+------+------+------+------+

Here is the sample code to generate this table

import pandas as pd
import numpy as np


my_dic =  {'COL_ID': ['aa1', 'aa1', 'aa2', 'ab3','ab3','ab3','ab3'],
'COLB': [1,2,3,6,5,7,1],
'COLC': [1,1,1,2,2,1,1],
'COLD':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'COLF':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'COLG':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'COLH':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'COLI:':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan]}
    
dp = pd.DataFrame(my_dic)
Asked By: Led

||

Answers:

First idea is use lambda function for all columns without COL_ID converted to index by DataFrame.set_index with DataFrame.stack, create default index by Series.reset_index for new columns names, last reshape by Series.unstack with converting to integers with missing values by Int64:

dp = (dp.set_index('COL_ID')
        .groupby('COL_ID')
        .apply(lambda x: x.stack().reset_index(drop=True))
        .unstack()
        .astype('Int64')
        .add_prefix('Col')
        .reset_index())
print (dp)
  COL_ID  Col0  Col1  Col2  Col3  Col4  Col5  Col6  Col7
0    aa1     1     1     2     1  <NA>  <NA>  <NA>  <NA>
1    aa2     3     1  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>
2    ab3     6     2     5     2     7     1     1     1

Another idea:

s = dp.set_index('COL_ID').stack().droplevel(1)
counter = s.groupby('COL_ID').cumcount().to_numpy()
s.index = [s.index, counter]

dp = s.unstack().astype('Int64').add_prefix('Col').reset_index()
print (dp)
  COL_ID  Col0  Col1  Col2  Col3  Col4  Col5  Col6  Col7
0    aa1     1     1     2     1  <NA>  <NA>  <NA>  <NA>
1    aa2     3     1  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>
2    ab3     6     2     5     2     7     1     1     1
Answered By: jezrael
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.