Probability of substring within string – this is being performed in python

Question:

I have a script that matches IDs in DF1 to DF2, in some cases the matches are completed via matching a string from DF1 as a substring to DF2.

For example:
DF1 ID: 2abc1
DF2 ID: 32abc13d

The match in the above case is 2abc1 (common string in both cases).
Although this matching method works well for the majority of my data there are some cases in which the common match only contains 2-3 strings.

For example:
DF1 ID: 10abe
DF2 ID: 3210c13d

the longest common match here is "10".

What I want to figure out is the probability that this match is incorrect?
I’m doing all my calculations in python so I’m hoping there might be a library for this?

Thanks

Asked By: PythonBeginner

||

Answers:

I’m not sure if I follow your question and if this is what you are looking for.
as you mention in the comment we can use fuzzywuzzy package if you want to learn more here there is a link.

https://towardsdatascience.com/string-comparison-is-easy-with-fuzzywuzzy-library-611cc1888d97

FuzzyWuzzy works with Levenshtein distance:
https://en.wikipedia.org/wiki/Levenshtein_distance

In your case you will need to do something similar to this.

First I generated some data, not sure about your entries, in the example the keys represent the string and the values the index

DF1_ID = {"2abc1": 1, "qsdf5": 2, "df5": 3, "qdqsdf5": 4, "13dab": 5}
DF2_ID = {"32abc13d": 1, "az9qsdf5": 2, "aqsdf5": 3, "3213dabc": 4}

Fuzzuwuzzy is not complicate to use. I think something similar to this should work.

In the for loop the keys of DF1_ID are compared and to all the keys of DF2_ID, and outputs the highest match.

the variable out is the match and the probability that goes from 0 to 100,
then after that we can set a minimum to assign the string, in our example is 90

from fuzzywuzzy import process

mapper = {}
for val in DF1_ID.keys():
    out = process.extract(val, DF2_ID.keys(), limit=1)[0]
    print(val, ": Similarity -->", out)
    if out[1] >= 90:
        mapper[val] = DF2_ID[out[0]]
    else:
        mapper[val] = None
    
print(mapper)

output

2abc1 : Similarity --> ('32abc13d', 90)
qsdf5 : Similarity --> ('aqsdf5', 91)
df5 : Similarity --> ('az9qsdf5', 90)
qdqsdf5 : Similarity --> ('aqsdf5', 77)
13dab : Similarity --> ('3213dabc', 90)

{'2abc1': 1, 'qsdf5': 3, 'df5': 2, 'qdqsdf5': None, '13dab': 4}

the mapper dictionary maps the keys of the DF1_ID with the index of DF2_ID, if a similarity to a key of DF2_ID is higher than 90

Answered By: Lucas M. Uriarte
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.