Assigning increasing integer numbers to distinct values that share identical values in the previous columns
Question:
I have a dataframe that goes like this
Index
One
Two
Three
Four
Five
Six
1
A
–
–
–
–
–
2
A
B
C
–
–
–
3
A
B
C
F
L
–
4
A
B
C
F
M
S
5
A
B
D
G
N
–
6
A
B
D
H
O
–
7
A
B
D
I
P
T
8
A
B
E
J
Q
–
9
A
B
E
K
R
U
And I would like an output like the following where the distinct identifiers (that builds on top of the combinations of the previous columns) are assigned an increasing integer number for each distinct value:
- Column "One" has only 1 distinct ("A") identifier so all of them are substituted by the integer 1.
- Column "Two" has 2 distinct identifiers "-" and "B" (this works as a regular pd.Categorical because they all share the same value in column "One")
- Column "Three" is where things get tricky for me. Row index 1 gets 1 because the "-" is the only distinct identifier for the combination ("A","-" the two previous columns "One" and "Two"). Row index 2, 3 and 4 gets 1 as well because "C" is the first distinct identifier for the combination ("A","B" coming respectively from column "One" and "Two"). Row 5, 6, 7 gets 2, because "D" is the second distinct identifier for the combination "A","B" etc…
- Last example row 3 and 4, column "Five". They got value 1 and 2 because they share the same path (A,B,C,F) until column "Four" but in column "Five" they got distinct values (L and M).
Index
One
Two
Three
Four
Five
Six
1
1
1
1
1
1
1
2
1
2
1
1
1
1
3
1
2
1
2
1
1
4
1
2
1
2
2
1
5
1
2
2
1
1
1
6
1
2
2
2
1
1
7
1
2
2
3
1
1
8
1
2
3
1
1
1
9
1
2
3
2
1
1
Apologies for the small essay.
And thanks for your help.
I tried to loop over multiples groupby but I got lost in it.
Regards,
Dario
Answers:
IIUC you need to perform successive groupby.ngroup
using the previous column as grouper:
out = pd.DataFrame(index=df.index)
out[df.columns[0]] = df.groupby(df.columns[0]).ngroup().add(1)
for i in range(1, df.shape[1]):
out[df.columns[i]] = (df
.groupby(df.columns[i-1], group_keys=False)
.apply(lambda g: g.groupby(df.columns[i]).ngroup().add(1)).squeeze()
)
print(out)
If you need to group by all previous columns, change the loop to:
for i in range(1, df.shape[1]):
out[df.columns[i]] = (df
.groupby(list(df.columns[:i]), group_keys=False)
.apply(lambda g: g.groupby(df.columns[i]).ngroup().add(1)).squeeze()
)
Output:
One Two Three Four Five Six
Index
1 1 1 1 1 1 1
2 1 2 1 1 1 1
3 1 2 1 2 1 1
4 1 2 1 2 2 1
5 1 2 2 1 1 1
6 1 2 2 2 1 1
7 1 2 2 3 1 1
8 1 2 3 1 1 1
9 1 2 3 2 1 1
I have a dataframe that goes like this
Index | One | Two | Three | Four | Five | Six |
---|---|---|---|---|---|---|
1 | A | – | – | – | – | – |
2 | A | B | C | – | – | – |
3 | A | B | C | F | L | – |
4 | A | B | C | F | M | S |
5 | A | B | D | G | N | – |
6 | A | B | D | H | O | – |
7 | A | B | D | I | P | T |
8 | A | B | E | J | Q | – |
9 | A | B | E | K | R | U |
And I would like an output like the following where the distinct identifiers (that builds on top of the combinations of the previous columns) are assigned an increasing integer number for each distinct value:
- Column "One" has only 1 distinct ("A") identifier so all of them are substituted by the integer 1.
- Column "Two" has 2 distinct identifiers "-" and "B" (this works as a regular pd.Categorical because they all share the same value in column "One")
- Column "Three" is where things get tricky for me. Row index 1 gets 1 because the "-" is the only distinct identifier for the combination ("A","-" the two previous columns "One" and "Two"). Row index 2, 3 and 4 gets 1 as well because "C" is the first distinct identifier for the combination ("A","B" coming respectively from column "One" and "Two"). Row 5, 6, 7 gets 2, because "D" is the second distinct identifier for the combination "A","B" etc…
- Last example row 3 and 4, column "Five". They got value 1 and 2 because they share the same path (A,B,C,F) until column "Four" but in column "Five" they got distinct values (L and M).
Index | One | Two | Three | Four | Five | Six |
---|---|---|---|---|---|---|
1 | 1 | 1 | 1 | 1 | 1 | 1 |
2 | 1 | 2 | 1 | 1 | 1 | 1 |
3 | 1 | 2 | 1 | 2 | 1 | 1 |
4 | 1 | 2 | 1 | 2 | 2 | 1 |
5 | 1 | 2 | 2 | 1 | 1 | 1 |
6 | 1 | 2 | 2 | 2 | 1 | 1 |
7 | 1 | 2 | 2 | 3 | 1 | 1 |
8 | 1 | 2 | 3 | 1 | 1 | 1 |
9 | 1 | 2 | 3 | 2 | 1 | 1 |
Apologies for the small essay.
And thanks for your help.
I tried to loop over multiples groupby but I got lost in it.
Regards,
Dario
IIUC you need to perform successive groupby.ngroup
using the previous column as grouper:
out = pd.DataFrame(index=df.index)
out[df.columns[0]] = df.groupby(df.columns[0]).ngroup().add(1)
for i in range(1, df.shape[1]):
out[df.columns[i]] = (df
.groupby(df.columns[i-1], group_keys=False)
.apply(lambda g: g.groupby(df.columns[i]).ngroup().add(1)).squeeze()
)
print(out)
If you need to group by all previous columns, change the loop to:
for i in range(1, df.shape[1]):
out[df.columns[i]] = (df
.groupby(list(df.columns[:i]), group_keys=False)
.apply(lambda g: g.groupby(df.columns[i]).ngroup().add(1)).squeeze()
)
Output:
One Two Three Four Five Six
Index
1 1 1 1 1 1 1
2 1 2 1 1 1 1
3 1 2 1 2 1 1
4 1 2 1 2 2 1
5 1 2 2 1 1 1
6 1 2 2 2 1 1
7 1 2 2 3 1 1
8 1 2 3 1 1 1
9 1 2 3 2 1 1