How to use a dictionary to standardize terminology across lists

Question:

I have been trying to solve this issue for a while but I can’t seem to think of a right solution.

Basically, I am parsing few pdfs and depending on the source of the pdf, the terminology used is different. For example, source A1 writes ‘Batman’ as ‘The Batman’. Source B2 writes it as ‘bat man’.

So what I tried to do is create a dictionary:

Voc_dict = {'Batman':'Batman',
'the Batman': 'Batman',
'bat man': 'Batman'}

Assume this dictionary extends to other superhero names.

So, I am trying to standardize the following 2d list:

Super_list  = [['among the heros with daddy issues, the bat man shines'], ['Bat man protects the city with everything he gots']]

You get the picture.

Apologies for the format and stupid example. I can’t find more relatable one and it is my first time using mobile app.

Thank guys.

What I did is the following:
Loop through the list and loop through dictionary.

For i in super_list:
    For key, value in voc_dict.items():
         i.replace(voc_dict[key], voc_dict[value])
Asked By: H. H.

||

Answers:

What I did is the following: Loop through the list and loop through dictionary.

for i in super_list:
    for key, value in voc_dict.items():
         i.replace(voc_dict[key], voc_dict[value])

I would expect there to be at least three issues with this:

  1. You mentioned that super_list is a nested list, for you also need a nested for-loop to traverse it. Also i is just a list [not a string] and does not have a .replace method, so i.replace would raise an AttributeError.
  2. As TimRoberts commented, .replace is not an inplace method, so you would need something like i = i.replace... to change i [if i was a string].
  3. Although, even if i was a string, there would be no point in using i = i.replace... because i would be a copy of an item in the list. Generally, you should use enumerate if you want to loop through and edit a list.
for si, sub_list in enumerate(super_list):
    for i, sl_item in enumerate(sub_list):
        for k, kw in Voc_dict.items():
            super_list[si][i] = sl_item.replace(k, kw)

However, if you try the above code on your sample super_list, you might notice that only the first item gets altered, so you need to either add 'Bat man': 'Batman' to Voc_dict or use regex with re.IGNORECASE by using re.sub(k, kw, sl_item, flags=re.I)(view output) instead of sl_item.replace(k, kw).

If you use regex, you can reduce the number of iterations by first reducing Voc_dict to something like {'(Batman|the Batman|bat man)': 'Batman'} with

Voc_dict = {'('+'|'.join([
    k for k,v in Voc_dict.items() if v==kw
])+')':kw for kw in set(Voc_dict.values())}
Answered By: Driftr95
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.