Laeble encoding pandas dataframe, same label for same value

Question:

Here is a snippet of my df:

        0    1    2    3    4    5   ...   11    12    13    14    15    16
0      BSO  PRV  BSI  TUR  WSP  ACP  ...  HLR   HEX   HEX  None  None  None
1      BSO  PRV  BSI  TUR  WSP  ACP  ...  HLF   HLR   HEX   HEX   HEX  None
2      BSO  PRV  BSI  HLF  HLR  TUR  ...  HEX   RSO   RSI   HEX   HEX   HEX
3      BSO  PRV  BSI  HLF  HLR  TUR  ...  RSO   RSI   HEX   HEX   HEX  None
4      BSO  PRV  BSI  HLF  TUR  WSP  ...  RSO   RSI   HLR   HEX   HEX   HEX
    ...  ...  ...  ...  ...  ...  ...  ...   ...   ...   ...   ...   ...
32607  BSO  PRV  BSI  TUR  WSP  ACP  ...  HEX  None  None  None  None  None
32608  BSO  PRV  BSI  TUR  WSP  ACP  ...  HEX  None  None  None  None  None
32609  BSO  PRV  BSI  TUR  WSP  ACP  ...  HEX  None  None  None  None  None
32610  BSO  PRV  BSI  TUR  WSP  ACP  ...  HEX  None  None  None  None  None
32611  BSO  PRV  BSI  TUR  WSP  ACP  ...  HEX  None  None  None  None  None

each cell is a string (obviously), and i want to label encode each row with the same value for each string in each row, for example, all BSO = 1, all ‘PRV = 2’ etc. The values do not matter as long as they are the same. I would like to exclude the None value if possible, but if not thats ok.

I tried df.apply(le.fit_transform) and the result was:

       0   1   2   3   4   5   6   7   8   9   10  11  12  13  14  15  16
0       0   0   0   2   2   0   1   1   3   2   1   2   0   0   1   1   1
1       0   0   0   2   2   0   1   1   1   3   3   1   2   0   0   0   1
2       0   0   0   0   0   1   2   4   0   0   0   0   4   3   0   0   0
3       0   0   0   0   0   1   3   0   1   0   0   4   3   0   0   0   1
4       0   0   0   0   1   2   2   0   1   0   0   4   3   2   0   0   0
    ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..
32607   0   0   0   2   2   0   1   2   2   1   2   0   5   4   1   1   1
32608   0   0   0   2   2   0   1   2   2   1   2   0   5   4   1   1   1
32609   0   0   0   2   2   0   1   2   2   1   2   0   5   4   1   1   1
32610   0   0   0   2   2   0   1   2   2   1   2   0   5   4   1   1   1
32611   0   0   0   2   2   0   1   2   2   1   2   0   5   4   1   1   1

and as you can compare, the integers do not match the values for each row.

Asked By: Tony Sirico

||

Answers:

It looks like the problem is that you have applied the transform on each column (default behaviour). Try:

df.apply(fit_transform, axis=1)

The axis=1 argument will result in fit_transform being applied to each row.

Hope it helps.

Answered By: LarryBird

You can create your own encoding function:

num2label = dict(enumerate(df.stack().unique(), 1))
label2num = {v: k for k, v in num2label.items()}

out = df.replace(label2num).fillna(0).astype(int)

Output:

>>> out
       0  1  2  3  4  5  11  12  13  14  15  16
0      1  2  3  4  5  6   7   8   8   0   0   0
1      1  2  3  4  5  6   9   7   8   8   8   0
2      1  2  3  9  7  4   8  10  11   8   8   8
3      1  2  3  9  7  4  10  11   8   8   8   0
4      1  2  3  9  4  5  10  11   7   8   8   8
32607  1  2  3  4  5  6   8   0   0   0   0   0
32608  1  2  3  4  5  6   8   0   0   0   0   0
32609  1  2  3  4  5  6   8   0   0   0   0   0
32610  1  2  3  4  5  6   8   0   0   0   0   0
32611  1  2  3  4  5  6   8   0   0   0   0   0

>>> label2num
{'BSO': 1,
 'PRV': 2,
 'BSI': 3,
 'TUR': 4,
 'WSP': 5,
 'ACP': 6,
 'HLR': 7,
 'HEX': 8,
 'HLF': 9,
 'RSO': 10,
 'RSI': 11}
Answered By: Corralien