Encounter an issue while trying to remove unicode emojis from strings

Question:

I am having a problem removing unicode emojis from my string. Here, I am providing some examples that I’ve seen in my data

['\\ud83d\\ude0e', '\\ud83e\\udd20', '\\ud83e\\udd23', '\\ud83d\\udc4d', '\\ud83d\\ude43', '\\ud83d\\ude31', '\\ud83d\\ude14', '\\ud83d\\udcaa', '\\ud83d\\ude0e', '\\ud83d\\ude09', '\\ud83d\\ude09', '\\ud83d\\ude18','\\ud83d\\ude01' , '\\ud83d\\ude44', '\\ud83d\\ude17']

I would like to remind that these are just some examples, not all of them and they are actually inside some strings in my data.

Here is the function I tried to remove them

def remove_emojis(data):
    emoji_pattern = re.compile(
        u"(\\ud83d[\\ude00-\\ude4f])|"  # emoticons
        u"(\\ud83c[\\udf00-\\uffff])|"  # symbols & pictographs (1 of 2)
        u"(\\ud83d[\\u0000-\\uddff])|"  # symbols & pictographs (2 of 2)
        u"(\\ud83d[\\ude80-\\udeff])|"  # transport & map symbols
        u"(\\ud83c[\\udde0-\\uddff])"  # flags (iOS)
        "+", flags=re.UNICODE)
    return re.sub(emoji_pattern, '', data)

If I use "Naja, gegen dich ist sie ein Waisenknabe \\ud83d\\ude02\\ud83d\\ude02\\ud83d\\ude02" as an input, my output is "Naja, gegen dich ist sie ein Waisenknabe \\ude02\\ude02\\ude02". However my desired output should be "Naja, gegen dich ist sie ein Waisenknabe ".

What is the mistake that I am doing and how can I fix that to get my desired results.

Asked By: bdorhan

||

Answers:

Since your text does not contain emoji chars themselves, but their representations in hexadecimal notation (uXXXX), you can use

data = re.sub(r's*(?:\+u[a-fA-F0-9]{4})+', '', data)

Details:

  • s* – zero or more whitespaces
  • (?:\+u[a-fA-F0-9]{4})+ – one or more sequences of
    • \+ – one or more backslashes
    • u – a u char
    • [a-fA-F0-9]{4} – four hex chars.

See the regex demo.

Answered By: Wiktor Stribiżew
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.