representing text data as numerical data

Question:

col_1 col_2
struct_1 AA
struct_bent_22 AA
sound_1 BB
sound_type_1 BB

I want to represent col_1 as numbers in a way that it retains the variability of the text. Any suggestions on how to do this ? If this is not a good idea, then any suggestions would be helpful.

expected output:

col_1 col_2
123 AA
1233 AA
12345 BB
123456 BB

Those numbers obviously do not capture what the text means, but I need a solution that captures the features of the text. (If possible)

Asked By: FalloutATS21

||

Answers:

While hashing always involves the risk of collision, I might look at:

import hashlib

def str_to_int(text):
    return int(hashlib.sha256(text.encode("utf-8")).hexdigest(), 16)

print(str_to_int("struct_1"))
print(str_to_int("struct_bent_22"))

and see if that works for you.

it should give you:

67086348026381773031976814245037562760713897031875549849861239415057698757195
24493846661393758293236341956602133508998612831345007873168413426300746343739
Answered By: JonSG
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.