Encounter an issue while trying to remove unicode emojis from strings
Question:
I am having a problem removing unicode emojis from my string. Here, I am providing some examples that I’ve seen in my data
['\\ud83d\\ude0e', '\\ud83e\\udd20', '\\ud83e\\udd23', '\\ud83d\\udc4d', '\\ud83d\\ude43', '\\ud83d\\ude31', '\\ud83d\\ude14', '\\ud83d\\udcaa', '\\ud83d\\ude0e', '\\ud83d\\ude09', '\\ud83d\\ude09', '\\ud83d\\ude18','\\ud83d\\ude01' , '\\ud83d\\ude44', '\\ud83d\\ude17']
I would like to remind that these are just some examples, not all of them and they are actually inside some strings in my data.
Here is the function I tried to remove them
def remove_emojis(data):
emoji_pattern = re.compile(
u"(\\ud83d[\\ude00-\\ude4f])|" # emoticons
u"(\\ud83c[\\udf00-\\uffff])|" # symbols & pictographs (1 of 2)
u"(\\ud83d[\\u0000-\\uddff])|" # symbols & pictographs (2 of 2)
u"(\\ud83d[\\ude80-\\udeff])|" # transport & map symbols
u"(\\ud83c[\\udde0-\\uddff])" # flags (iOS)
"+", flags=re.UNICODE)
return re.sub(emoji_pattern, '', data)
If I use "Naja, gegen dich ist sie ein Waisenknabe \\ud83d\\ude02\\ud83d\\ude02\\ud83d\\ude02"
as an input, my output is "Naja, gegen dich ist sie ein Waisenknabe \\ude02\\ude02\\ude02"
. However my desired output should be "Naja, gegen dich ist sie ein Waisenknabe "
.
What is the mistake that I am doing and how can I fix that to get my desired results.
Answers:
Since your text does not contain emoji chars themselves, but their representations in hexadecimal notation (uXXXX
), you can use
data = re.sub(r's*(?:\+u[a-fA-F0-9]{4})+', '', data)
Details:
s*
– zero or more whitespaces
(?:\+u[a-fA-F0-9]{4})+
– one or more sequences of
\+
– one or more backslashes
u
– a u
char
[a-fA-F0-9]{4}
– four hex chars.
See the regex demo.
I am having a problem removing unicode emojis from my string. Here, I am providing some examples that I’ve seen in my data
['\\ud83d\\ude0e', '\\ud83e\\udd20', '\\ud83e\\udd23', '\\ud83d\\udc4d', '\\ud83d\\ude43', '\\ud83d\\ude31', '\\ud83d\\ude14', '\\ud83d\\udcaa', '\\ud83d\\ude0e', '\\ud83d\\ude09', '\\ud83d\\ude09', '\\ud83d\\ude18','\\ud83d\\ude01' , '\\ud83d\\ude44', '\\ud83d\\ude17']
I would like to remind that these are just some examples, not all of them and they are actually inside some strings in my data.
Here is the function I tried to remove them
def remove_emojis(data):
emoji_pattern = re.compile(
u"(\\ud83d[\\ude00-\\ude4f])|" # emoticons
u"(\\ud83c[\\udf00-\\uffff])|" # symbols & pictographs (1 of 2)
u"(\\ud83d[\\u0000-\\uddff])|" # symbols & pictographs (2 of 2)
u"(\\ud83d[\\ude80-\\udeff])|" # transport & map symbols
u"(\\ud83c[\\udde0-\\uddff])" # flags (iOS)
"+", flags=re.UNICODE)
return re.sub(emoji_pattern, '', data)
If I use "Naja, gegen dich ist sie ein Waisenknabe \\ud83d\\ude02\\ud83d\\ude02\\ud83d\\ude02"
as an input, my output is "Naja, gegen dich ist sie ein Waisenknabe \\ude02\\ude02\\ude02"
. However my desired output should be "Naja, gegen dich ist sie ein Waisenknabe "
.
What is the mistake that I am doing and how can I fix that to get my desired results.
Since your text does not contain emoji chars themselves, but their representations in hexadecimal notation (uXXXX
), you can use
data = re.sub(r's*(?:\+u[a-fA-F0-9]{4})+', '', data)
Details:
s*
– zero or more whitespaces(?:\+u[a-fA-F0-9]{4})+
– one or more sequences of\+
– one or more backslashesu
– au
char[a-fA-F0-9]{4}
– four hex chars.
See the regex demo.