Encounter an issue while trying to remove unicode emojis from strings

Question

I am having a problem removing unicode emojis from my string. Here, I am providing some examples that I’ve seen in my data

['\\ud83d\\ude0e', '\\ud83e\\udd20', '\\ud83e\\udd23', '\\ud83d\\udc4d', '\\ud83d\\ude43', '\\ud83d\\ude31', '\\ud83d\\ude14', '\\ud83d\\udcaa', '\\ud83d\\ude0e', '\\ud83d\\ude09', '\\ud83d\\ude09', '\\ud83d\\ude18','\\ud83d\\ude01' , '\\ud83d\\ude44', '\\ud83d\\ude17']

I would like to remind that these are just some examples, not all of them and they are actually inside some strings in my data.

Here is the function I tried to remove them

def remove_emojis(data):
    emoji_pattern = re.compile(
        u"(\\ud83d[\\ude00-\\ude4f])|"  # emoticons
        u"(\\ud83c[\\udf00-\\uffff])|"  # symbols & pictographs (1 of 2)
        u"(\\ud83d[\\u0000-\\uddff])|"  # symbols & pictographs (2 of 2)
        u"(\\ud83d[\\ude80-\\udeff])|"  # transport & map symbols
        u"(\\ud83c[\\udde0-\\uddff])"  # flags (iOS)
        "+", flags=re.UNICODE)
    return re.sub(emoji_pattern, '', data)

If I use "Naja, gegen dich ist sie ein Waisenknabe \\ud83d\\ude02\\ud83d\\ude02\\ud83d\\ude02" as an input, my output is "Naja, gegen dich ist sie ein Waisenknabe \\ude02\\ude02\\ude02". However my desired output should be "Naja, gegen dich ist sie ein Waisenknabe ".

What is the mistake that I am doing and how can I fix that to get my desired results.

Asked By: bdorhan

||

Source

Answer 1

Since your text does not contain emoji chars themselves, but their representations in hexadecimal notation (uXXXX), you can use

data = re.sub(r's*(?:\+u[a-fA-F0-9]{4})+', '', data)

Details:

s* – zero or more whitespaces
(?:\+u[a-fA-F0-9]{4})+ – one or more sequences of
- \+ – one or more backslashes
- u – a u char
- [a-fA-F0-9]{4} – four hex chars.

See the regex demo.

Answered By: Wiktor Stribiżew

Encounter an issue while trying to remove unicode emojis from strings

Question:

Answers: