Get Closest match for a column in data frame
Question:
I have a data Frame which contains different call types as below values
CallType
0 IN
1 OUT
2 a_in
3 asms
4 INCOMING
5 OUTGOING
6 A2P_SMSIN
7 ain
8 aout
I want to map this in such a way the output would be
CallType
0 IN
1 OUT
2 IN
3 SMS
4 IN
5 OUT
6 SMS
7 IN
8 OUT
I am trying to use difflib.closestmatch but it gives no result . Below is my code
CALL_TYPE=['IN','OUT','SMS','VOICE','SMT']
def test1():
final_file_data = pd.DataFrame({
'CallType': ['IN', 'OUT', 'a_in',
'asms', 'INCOMING', 'OUTGOING','A2P_SMSIN',
'ain', 'aout']})
print(final_file_data)
final_file_data['CallType'] = final_file_data['CallType'].apply(lambda x: difflib.get_close_matches(x, CALL_TYPE, n=1))
The output I get is below which as results only for IN and OUT
CallType
0 [IN]
1 [OUT]
2 []
3 []
4 []
5 []
6 []
7 []
8 []
I am not sure where I am going wrong .
Answers:
It has to do with get_close_matches
being case-sensitive
and the cutoff
for the score that is gotten for similarity. You can manipulate the x
string to upper()
and change the cutoff
to be less stringent. This is what I did:
final_file_data['CallType'] = final_file_data['CallType'].apply(lambda x: difflib.get_close_matches(x.upper(), CALL_TYPE, n=1, cutoff=0))
final_file_data is now:
CallType
0 [IN]
1 [OUT]
2 [IN]
3 [SMS]
4 [IN]
5 [OUT]
6 [SMS]
7 [IN]
8 [OUT]
You can read more about the get_close_matches
here to read more about the cutoff
argument.
I have a data Frame which contains different call types as below values
CallType
0 IN
1 OUT
2 a_in
3 asms
4 INCOMING
5 OUTGOING
6 A2P_SMSIN
7 ain
8 aout
I want to map this in such a way the output would be
CallType
0 IN
1 OUT
2 IN
3 SMS
4 IN
5 OUT
6 SMS
7 IN
8 OUT
I am trying to use difflib.closestmatch but it gives no result . Below is my code
CALL_TYPE=['IN','OUT','SMS','VOICE','SMT']
def test1():
final_file_data = pd.DataFrame({
'CallType': ['IN', 'OUT', 'a_in',
'asms', 'INCOMING', 'OUTGOING','A2P_SMSIN',
'ain', 'aout']})
print(final_file_data)
final_file_data['CallType'] = final_file_data['CallType'].apply(lambda x: difflib.get_close_matches(x, CALL_TYPE, n=1))
The output I get is below which as results only for IN and OUT
CallType
0 [IN]
1 [OUT]
2 []
3 []
4 []
5 []
6 []
7 []
8 []
I am not sure where I am going wrong .
It has to do with get_close_matches
being case-sensitive
and the cutoff
for the score that is gotten for similarity. You can manipulate the x
string to upper()
and change the cutoff
to be less stringent. This is what I did:
final_file_data['CallType'] = final_file_data['CallType'].apply(lambda x: difflib.get_close_matches(x.upper(), CALL_TYPE, n=1, cutoff=0))
final_file_data is now:
CallType
0 [IN]
1 [OUT]
2 [IN]
3 [SMS]
4 [IN]
5 [OUT]
6 [SMS]
7 [IN]
8 [OUT]
You can read more about the get_close_matches
here to read more about the cutoff
argument.