How to use a dictionary to standardize terminology across lists
Question:
I have been trying to solve this issue for a while but I can’t seem to think of a right solution.
Basically, I am parsing few pdfs and depending on the source of the pdf, the terminology used is different. For example, source A1 writes ‘Batman’ as ‘The Batman’. Source B2 writes it as ‘bat man’.
So what I tried to do is create a dictionary:
Voc_dict = {'Batman':'Batman',
'the Batman': 'Batman',
'bat man': 'Batman'}
Assume this dictionary extends to other superhero names.
So, I am trying to standardize the following 2d list:
Super_list = [['among the heros with daddy issues, the bat man shines'], ['Bat man protects the city with everything he gots']]
You get the picture.
Apologies for the format and stupid example. I can’t find more relatable one and it is my first time using mobile app.
Thank guys.
What I did is the following:
Loop through the list and loop through dictionary.
For i in super_list:
For key, value in voc_dict.items():
i.replace(voc_dict[key], voc_dict[value])
Answers:
What I did is the following: Loop through the list and loop through dictionary.
for i in super_list:
for key, value in voc_dict.items():
i.replace(voc_dict[key], voc_dict[value])
I would expect there to be at least three issues with this:
- You mentioned that
super_list
is a nested list, for you also need a nested for
-loop to traverse it. Also i
is just a list [not a string] and does not have a .replace
method, so i.replace
would raise an AttributeError
.
- As TimRoberts commented,
.replace
is not an inplace method, so you would need something like i = i.replace...
to change i
[if i
was a string].
- Although, even if
i
was a string, there would be no point in using i = i.replace...
because i
would be a copy of an item in the list. Generally, you should use enumerate
if you want to loop through and edit a list.
for si, sub_list in enumerate(super_list):
for i, sl_item in enumerate(sub_list):
for k, kw in Voc_dict.items():
super_list[si][i] = sl_item.replace(k, kw)
However, if you try the above code on your sample super_list
, you might notice that only the first item gets altered, so you need to either add 'Bat man': 'Batman'
to Voc_dict
or use regex with re.IGNORECASE
by using re.sub(k, kw, sl_item, flags=re.I)
(view output) instead of sl_item.replace(k, kw)
.
If you use regex, you can reduce the number of iterations by first reducing Voc_dict
to something like {'(Batman|the Batman|bat man)': 'Batman'}
with
Voc_dict = {'('+'|'.join([
k for k,v in Voc_dict.items() if v==kw
])+')':kw for kw in set(Voc_dict.values())}
I have been trying to solve this issue for a while but I can’t seem to think of a right solution.
Basically, I am parsing few pdfs and depending on the source of the pdf, the terminology used is different. For example, source A1 writes ‘Batman’ as ‘The Batman’. Source B2 writes it as ‘bat man’.
So what I tried to do is create a dictionary:
Voc_dict = {'Batman':'Batman',
'the Batman': 'Batman',
'bat man': 'Batman'}
Assume this dictionary extends to other superhero names.
So, I am trying to standardize the following 2d list:
Super_list = [['among the heros with daddy issues, the bat man shines'], ['Bat man protects the city with everything he gots']]
You get the picture.
Apologies for the format and stupid example. I can’t find more relatable one and it is my first time using mobile app.
Thank guys.
What I did is the following:
Loop through the list and loop through dictionary.
For i in super_list:
For key, value in voc_dict.items():
i.replace(voc_dict[key], voc_dict[value])
What I did is the following: Loop through the list and loop through dictionary.
for i in super_list: for key, value in voc_dict.items(): i.replace(voc_dict[key], voc_dict[value])
I would expect there to be at least three issues with this:
- You mentioned that
super_list
is a nested list, for you also need a nestedfor
-loop to traverse it. Alsoi
is just a list [not a string] and does not have a.replace
method, soi.replace
would raise anAttributeError
. - As TimRoberts commented,
.replace
is not an inplace method, so you would need something likei = i.replace...
to changei
[ifi
was a string]. - Although, even if
i
was a string, there would be no point in usingi = i.replace...
becausei
would be a copy of an item in the list. Generally, you should useenumerate
if you want to loop through and edit a list.
for si, sub_list in enumerate(super_list):
for i, sl_item in enumerate(sub_list):
for k, kw in Voc_dict.items():
super_list[si][i] = sl_item.replace(k, kw)
However, if you try the above code on your sample super_list
, you might notice that only the first item gets altered, so you need to either add 'Bat man': 'Batman'
to Voc_dict
or use regex with re.IGNORECASE
by using re.sub(k, kw, sl_item, flags=re.I)
(view output) instead of sl_item.replace(k, kw)
.
If you use regex, you can reduce the number of iterations by first reducing Voc_dict
to something like {'(Batman|the Batman|bat man)': 'Batman'}
with
Voc_dict = {'('+'|'.join([
k for k,v in Voc_dict.items() if v==kw
])+')':kw for kw in set(Voc_dict.values())}