How do I convert a unicode text to a text that python can read so that I could find that specific word in webscraping results?

Question:

I am trying to scrape text in instagram and check if I could find some keywords in the bio but the user use a special fonts, so I cannot identify the specific word, how can I remove the fonts or formot of a text such that I can search the word?

import re
test="             . "


x = re.findall(re.compile('past'), test)
if x:    
    print("TEXT FOUND")
else:
    print("TEXT NOT FOUND")

TEXT NOT FOUND

Another example:

import re
test="ғʀᴇᴇʟᴀɴᴄᴇ ɢʀᴀᴘʜɪᴄ ᴅᴇsɪɢɴᴇʀ"
test=test.lower()

x = re.findall(re.compile('graphic'), test)
if x:    
    print("TEXT FOUND")
else:
    print("TEXT NOT FOUND")

TEXT NOT FOUND

Asked By: jguy

||

Answers:

you can use unicodedata.normalize that Return the normal form for the Unicode string. For your examples see the following code snippet:

import re
import unicodedata

test="             . "
 
formatted_test = unicodedata.normalize('NFKD', test).encode('ascii', 'ignore').decode('utf-8')

x = re.findall(re.compile('past'), formatted_test)
if x:    
    print("TEXT FOUND")
else:
    print("TEXT NOT FOUND")

and the output will be:

TEXT FOUND

Answered By: Javad

Problem 1:

Take care if you are dealing with texts in Portuguese.
If you have:

string = """  orçamento"""

And you use:

unicodedata.normalize('NFKD', string).encode('ascii', 'ignore').decode('utf-8')

You will lost cedilha (ç), it means, orçamento will be orcamento.

Otherwise, if you use:

unicodedata.normalize('NFKC', string)

You will keep cedilha.

Note that I changed NFKD to NFKC, beyond cut encode and decode.

Problem 2:

Take this examples (they are real examples that I found in Instagram):

    string2 = """ᴍᴇᴜ ᴄᴏʀᴀçãᴏ ᴀᴛé ᴘᴜʟᴏᴜ ǫᴜᴀɴᴅᴏ ᴇʟᴀ ᴘᴀssᴏᴜ, ᴍᴀs ᴏ ǫᴜᴇ ғᴇᴢ ᴇʟᴇ ᴘᴀʀᴀʀ ғᴏɪ sᴇᴜ ᴀʙʀᴀçᴏ"""
    string3 = """ """
    string4 = """(n̶ã̶o̶ ̶u̶s̶e̶ ̶á̶g̶u̶a̶ ̶d̶o̶c̶e̶!̶)"""

The lib Unicodedata is not able to normalize them.

Note that, string2 looks like "normal", but it is write using LATIN LETTER SMALL instead LATIN, besides the letter F is not an F, it is CYRILLIC SMALL LETTER GHE WITH STROKE.

One alternative is Unidecode https://pypi.org/project/Unidecode/

print(unidecode(string2)) 
MEU CORAcaO ATe PULOU oUAnDO ELA PAssOU, MAs O oUE g'EZ ELE PARAR g'OI sEU ABRAcO

print(unidecode(string3))
(C)(U)(I)(D)(A)(D)(O)

print(unidecode(string4))
nao use agua doce!

But Unidecode will normalize everything to ASCII, so we will back to the problem 1.

Answered By: Heloisa Rocha