Get Closest match for a column in data frame

Question:

I have a data Frame which contains different call types as below values

    CallType
0         IN
1        OUT
2       a_in
3       asms
4   INCOMING
5   OUTGOING
6  A2P_SMSIN
7        ain
8       aout

I want to map this in such a way the output would be

    CallType
0       IN
1       OUT
2       IN
3       SMS
4       IN
5       OUT
6       SMS
7       IN
8       OUT

I am trying to use difflib.closestmatch but it gives no result . Below is my code

CALL_TYPE=['IN','OUT','SMS','VOICE','SMT']

def test1():
    final_file_data = pd.DataFrame({
        'CallType': ['IN', 'OUT', 'a_in',
                         'asms', 'INCOMING', 'OUTGOING','A2P_SMSIN',
                         'ain', 'aout']})

    print(final_file_data)
    final_file_data['CallType'] = final_file_data['CallType'].apply(lambda x: difflib.get_close_matches(x, CALL_TYPE, n=1))

The output I get is below which as results only for IN and OUT

 CallType
0     [IN]
1    [OUT]
2       []
3       []
4       []
5       []
6       []
7       []
8       []

I am not sure where I am going wrong .

Asked By: arpit joshi

||

Answers:

It has to do with get_close_matches being case-sensitive and the cutoff for the score that is gotten for similarity. You can manipulate the x string to upper() and change the cutoff to be less stringent. This is what I did:

final_file_data['CallType'] = final_file_data['CallType'].apply(lambda x: difflib.get_close_matches(x.upper(), CALL_TYPE, n=1, cutoff=0))

final_file_data is now:

  CallType
0     [IN]
1    [OUT]
2     [IN]
3    [SMS]
4     [IN]
5    [OUT]
6    [SMS]
7     [IN]
8    [OUT]

You can read more about the get_close_matches here to read more about the cutoff argument.

Answered By: Marcelo Paco
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.