Assigning increasing integer numbers to distinct values that share identical values in the previous columns

Question:

I have a dataframe that goes like this

Index One Two Three Four Five Six
1 A
2 A B C
3 A B C F L
4 A B C F M S
5 A B D G N
6 A B D H O
7 A B D I P T
8 A B E J Q
9 A B E K R U

And I would like an output like the following where the distinct identifiers (that builds on top of the combinations of the previous columns) are assigned an increasing integer number for each distinct value:

  • Column "One" has only 1 distinct ("A") identifier so all of them are substituted by the integer 1.
  • Column "Two" has 2 distinct identifiers "-" and "B" (this works as a regular pd.Categorical because they all share the same value in column "One")
  • Column "Three" is where things get tricky for me. Row index 1 gets 1 because the "-" is the only distinct identifier for the combination ("A","-" the two previous columns "One" and "Two"). Row index 2, 3 and 4 gets 1 as well because "C" is the first distinct identifier for the combination ("A","B" coming respectively from column "One" and "Two"). Row 5, 6, 7 gets 2, because "D" is the second distinct identifier for the combination "A","B" etc…
  • Last example row 3 and 4, column "Five". They got value 1 and 2 because they share the same path (A,B,C,F) until column "Four" but in column "Five" they got distinct values (L and M).
Index One Two Three Four Five Six
1 1 1 1 1 1 1
2 1 2 1 1 1 1
3 1 2 1 2 1 1
4 1 2 1 2 2 1
5 1 2 2 1 1 1
6 1 2 2 2 1 1
7 1 2 2 3 1 1
8 1 2 3 1 1 1
9 1 2 3 2 1 1

Apologies for the small essay.
And thanks for your help.
I tried to loop over multiples groupby but I got lost in it.

Regards,
Dario

Asked By: Dario Bani

||

Answers:

IIUC you need to perform successive groupby.ngroup using the previous column as grouper:

out = pd.DataFrame(index=df.index)

out[df.columns[0]] = df.groupby(df.columns[0]).ngroup().add(1)

for i in range(1, df.shape[1]):
    out[df.columns[i]] = (df
  .groupby(df.columns[i-1], group_keys=False)
  .apply(lambda g: g.groupby(df.columns[i]).ngroup().add(1)).squeeze()
)

print(out)

If you need to group by all previous columns, change the loop to:

for i in range(1, df.shape[1]):
    out[df.columns[i]] = (df
  .groupby(list(df.columns[:i]), group_keys=False)
  .apply(lambda g: g.groupby(df.columns[i]).ngroup().add(1)).squeeze()
)

Output:

       One  Two  Three  Four  Five  Six
Index                                  
1        1    1      1     1     1    1
2        1    2      1     1     1    1
3        1    2      1     2     1    1
4        1    2      1     2     2    1
5        1    2      2     1     1    1
6        1    2      2     2     1    1
7        1    2      2     3     1    1
8        1    2      3     1     1    1
9        1    2      3     2     1    1
Answered By: mozway
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.