removing emojis from a string in Python
Question:
I found this code in Python for removing emojis but it is not working. Can you help with other codes or fix to this?
I have observed all my emjois start with xf
but when I try to search for str.startswith("xf")
I get invalid character error.
emoji_pattern = r'/[x{1F601}-x{1F64F}]/u'
re.sub(emoji_pattern, '', word)
Here’s the error:
Traceback (most recent call last):
File "test.py", line 52, in <module>
re.sub(emoji_pattern,'',word)
File "/usr/lib/python2.7/re.py", line 151, in sub
return _compile(pattern, flags).sub(repl, string, count)
File "/usr/lib/python2.7/re.py", line 244, in _compile
raise error, v # invalid expression
sre_constants.error: bad character range
Each of the items in a list can be a word ['This', 'dog', 'xf0x9fx98x82', 'https://t.co/5N86jYipOI']
UPDATE:
I used this other code:
emoji_pattern=re.compile(ur" " " [U0001F600-U0001F64F] # emoticons
|
[U0001F300-U0001F5FF] # symbols & pictographs
|
[U0001F680-U0001F6FF] # transport & map symbols
|
[U0001F1E0-U0001F1FF] # flags (iOS)
" " ", re.VERBOSE)
emoji_pattern.sub('', word)
But this still doesn’t remove the emojis and shows them! Any clue why is that?
Answers:
Because [...]
means any one of a set of characters, and because two characters in a group separated by a dash means a range of characters (often, “a-z” or “0-9”), your pattern says “a slash, followed by any characters in the group containing x, {, 1, F, 6, 0, 1, the range } through x, {, 1, F, 6, 4, f or }” followed by a slash and the letter u”. That range in the middle is what re is calling the bad character range.
On Python 2, you have to use u''
literal to create a Unicode string. Also, you should pass re.UNICODE
flag and convert your input data to Unicode (e.g., text = data.decode('utf-8')
):
#!/usr/bin/env python
import re
text = u'This dog U0001f602'
print(text) # with emoji
emoji_pattern = re.compile("["
u"U0001F600-U0001F64F" # emoticons
u"U0001F300-U0001F5FF" # symbols & pictographs
u"U0001F680-U0001F6FF" # transport & map symbols
u"U0001F1E0-U0001F1FF" # flags (iOS)
"]+", flags=re.UNICODE)
print(emoji_pattern.sub(r'', text)) # no emoji
Output
This dog
This dog
Note: emoji_pattern
matches only some emoji (not all). See Which Characters are Emoji.
If you’re using the example from the accepted answer and still getting “bad character range” errors, then you’re probably using a narrow build (see this answer for more details). A reformatted version of the regex that seems to work is:
emoji_pattern = re.compile(
u"(ud83d[ude00-ude4f])|" # emoticons
u"(ud83c[udf00-uffff])|" # symbols & pictographs (1 of 2)
u"(ud83d[u0000-uddff])|" # symbols & pictographs (2 of 2)
u"(ud83d[ude80-udeff])|" # transport & map symbols
u"(ud83c[udde0-uddff])" # flags (iOS)
"+", flags=re.UNICODE)
Accepted answer, and others worked for me for a bit, but I ultimately decided to strip all characters outside of the Basic Multilingual Plane. This excludes future additions to other Unicode planes (where emoji’s and such live), which means I don’t have to update my code every time new Unicode characters are added :).
In Python 2.7 convert to unicode if your text is not already, and then use the negative regex below (subs anything not in regex, which is all characters from BMP except for surrogates, which are used to create 2 byte Supplementary Multilingual Plane characters).
NON_BMP_RE = re.compile(u"[^U00000000-U0000d7ffU0000e000-U0000ffff]", flags=re.UNICODE)
NON_BMP_RE.sub(u'', unicode(text, 'utf-8'))
Complete vesrion Of remove emojies:
import re
def remove_emoji(string):
emoji_pattern = re.compile("["
u"U0001F600-U0001F64F" # emoticons
u"U0001F300-U0001F5FF" # symbols & pictographs
u"U0001F680-U0001F6FF" # transport & map symbols
u"U0001F1E0-U0001F1FF" # flags (iOS)
u"U00002702-U000027B0"
u"U000024C2-U0001F251"
"]+", flags=re.UNICODE)
return emoji_pattern.sub(r'', string)
Tried all the answers, unfortunately, they didn’t remove the new hugging face emoji or the clinking glasses emoji or , and a lot more.
Ended up with a list of all possible emoji, taken from the python emoji package on github, and I had to create a gist because there’s a 30k character limit on stackoverflow answers and it’s over 70k characters.
I am updating my answer to this by @jfs because my previous answer failed to account for other Unicode standards such as Latin, Greek etc. StackOverFlow doesn’t allow me to delete my previous answer hence I am updating it to match the most acceptable answer to the question.
#!/usr/bin/env python
import re
text = u'This is a smiley face U0001f602'
print(text) # with emoji
def deEmojify(text):
regrex_pattern = re.compile(pattern = "["
u"U0001F600-U0001F64F" # emoticons
u"U0001F300-U0001F5FF" # symbols & pictographs
u"U0001F680-U0001F6FF" # transport & map symbols
u"U0001F1E0-U0001F1FF" # flags (iOS)
"]+", flags = re.UNICODE)
return regrex_pattern.sub(r'',text)
print(deEmojify(text))
This was my previous answer, do not use this.
def deEmojify(inputString):
return inputString.encode('ascii', 'ignore').decode('ascii')
If you are not keen on using regex, the best solution could be using the emoji python package.
Here is a simple function to return emoji free text (thanks to this SO answer):
import emoji
def give_emoji_free_text(text):
allchars = [str for str in text.decode('utf-8')]
emoji_list = [c for c in allchars if c in emoji.UNICODE_EMOJI]
clean_text = ' '.join([str for str in text.decode('utf-8').split() if not any(i in str for i in emoji_list)])
return clean_text
If you are dealing with strings containing emojis, this is straightforward
>> s1 = "Hi How is your and . Have a nice weekend "
>> print s1
Hi How is your and . Have a nice weekend
>> print give_emoji_free_text(s1)
Hi How is your and Have a nice weekend
If you are dealing with unicode (as in the exmaple by @jfs), just encode it with utf-8.
>> s2 = u'This dog U0001f602'
>> print s2
This dog
>> print give_emoji_free_text(s2.encode('utf8'))
This dog
Edits
Based on the comment, it should be as easy as:
def give_emoji_free_text(text):
return emoji.get_emoji_regexp().sub(r'', text.decode('utf8'))
this is my solution. This solution removes additional man and woman emoji which cant be renered by python ♂ and ♀
emoji_pattern = re.compile("["
u"U0001F600-U0001F64F" # emoticons
u"U0001F300-U0001F5FF" # symbols & pictographs
u"U0001F680-U0001F6FF" # transport & map symbols
u"U0001F1E0-U0001F1FF" # flags (iOS)
u"U00002702-U000027B0"
u"U000024C2-U0001F251"
u"U0001f926-U0001f937"
u"u200d"
u"u2640-u2642"
"]+", flags=re.UNICODE)
Converting the string into another character set like this might help:
text.encode('latin-1', 'ignore').decode('latin-1')
Kind regards.
I tried to collect the complete list of unicodes.
I use it to extract emojis from tweets and it work very well for me.
# Emojis pattern
emoji_pattern = re.compile("["
u"U0001F600-U0001F64F" # emoticons
u"U0001F300-U0001F5FF" # symbols & pictographs
u"U0001F680-U0001F6FF" # transport & map symbols
u"U0001F1E0-U0001F1FF" # flags (iOS)
u"U00002702-U000027B0"
u"U000024C2-U0001F251"
u"U0001f926-U0001f937"
u'U00010000-U0010ffff'
u"u200d"
u"u2640-u2642"
u"u2600-u2B55"
u"u23cf"
u"u23e9"
u"u231a"
u"u3030"
u"ufe0f"
"]+", flags=re.UNICODE)
Here’s a Python 3 script that uses the emoji library’s get_emoji_regexp()
– as suggested by kingmakerking and Martijn Pieters in their answer/comment.
It reads text from a file and writes the emoji-free text to another file.
import emoji
import re
def strip_emoji(text):
print(emoji.emoji_count(text))
new_text = re.sub(emoji.get_emoji_regexp(), r"", text)
return new_text
with open("my_file.md", "r") as file:
old_text = file.read()
no_emoji_text = strip_emoji(old_text)
with open("file.md", "w+") as new_file:
new_file.write(no_emoji_text)
Complete Version of remove Emojis
✍
import re
def remove_emojis(data):
emoj = re.compile("["
u"U0001F600-U0001F64F" # emoticons
u"U0001F300-U0001F5FF" # symbols & pictographs
u"U0001F680-U0001F6FF" # transport & map symbols
u"U0001F1E0-U0001F1FF" # flags (iOS)
u"U00002500-U00002BEF" # chinese char
u"U00002702-U000027B0"
u"U00002702-U000027B0"
u"U000024C2-U0001F251"
u"U0001f926-U0001f937"
u"U00010000-U0010ffff"
u"u2640-u2642"
u"u2600-u2B55"
u"u200d"
u"u23cf"
u"u23e9"
u"u231a"
u"ufe0f" # dingbats
u"u3030"
"]+", re.UNICODE)
return re.sub(emoj, '', data)
The best solution to this will be to use an external library emoji . This library is continuosly updated with latest emojis and thus can be used to find them in any text. Unlike the ascii decode method which remove all unicode characters this method keeps them and only remove emojis.
- First install emoji library if you don’t have:
pip install emoji
- Next import it in your file/project :
import emoji
- Now to remove all emojis use the statement:
emoji.get_emoji_regexp().sub("", msg)
where msg is the text to be edited
That’s all you need.
I know this may not be directly related to question asked but It is helpful in solving the parent problem that is removing emojis from text. There is a module named demoji in python which does this task very accurately and removes almost all types of emojis. It also updates regularly to provide up to date emoji removal support.
For removing an emoji demoji.replace(text, '')
is used.
For me the following worked in python 3.8 for substituting emojis:
import re
result = re.sub('[(U0001F600-U0001F92F|U0001F300-U0001F5FF|U0001F680-U0001F6FF|U0001F190-U0001F1FF|U00002702-U000027B0|U0001F926-U0001FA9F|u200d|u2640-u2642|u2600-u2B55|u23cf|u23e9|u231a|ufe0f)]+','','A quick brown fox jumps over the lazy dog ')
Its a much simplified version of the answers given here.
I tested this code for i18n support, tested with english,russian,chinese and japanese. only emojis were removed.
This is not an exhaustive list , may have missed some emojis, but works for most of the common emojis
This is the easiest code for remove all emoji.
import emoji
def remove_emojis(text: str) -> str:
return ''.join(c for c in text if c not in emoji.UNICODE_EMOJI)
I simply removed all the special characters using regex and this worked for me.
sent_0 = re.sub('[^A-Za-z0-9]+', ' ', sent_0)
For those still using Python 2.7, this regex might help:
(?:[u2700-u27bf]|(?:ud83c[udde6-uddff]){2}|[ud800-udbff][udc00-udfff]|[u0023-u0039]ufe0f?u20e3|u3299|u3297|u303d|u3030|u24c2|ud83c[udd70-udd71]|ud83c[udd7e-udd7f]|ud83cudd8e|ud83c[udd91-udd9a]|ud83c[udde6-uddff]|[ud83cude01-ude02]|ud83cude1a|ud83cude2f|[ud83cude32-ude3a]|[ud83cude50-ude51]|u203c|u2049|[u25aa-u25ab]|u25b6|u25c0|[u25fb-u25fe]|u00a9|u00ae|u2122|u2139|ud83cudc04|[u2600-u26FF]|u2b05|u2b06|u2b07|u2b1b|u2b1c|u2b50|u2b55|u231a|u231b|u2328|u23cf|[u23e9-u23f3]|[u23f8-u23fa]|ud83cudccf|u2934|u2935|[u2190-u21ff])
So to use it in your code, it will somewhat look like this:
emoji_pattern = re.compile(
u"(?:[u2700-u27bf]|(?:ud83c[udde6-uddff]){2}|[ud800-udbff][udc00-udfff]|[u0023-u0039]ufe0f?u20e3|u3299|u3297|u303d|u3030|u24c2|ud83c[udd70-udd71]|ud83c[udd7e-udd7f]|ud83cudd8e|ud83c[udd91-udd9a]|ud83c[udde6-uddff]|[ud83cude01-ude02]|ud83cude1a|ud83cude2f|[ud83cude32-ude3a]|[ud83cude50-ude51]|u203c|u2049|[u25aa-u25ab]|u25b6|u25c0|[u25fb-u25fe]|u00a9|u00ae|u2122|u2139|ud83cudc04|[u2600-u26FF]|u2b05|u2b06|u2b07|u2b1b|u2b1c|u2b50|u2b55|u231a|u231b|u2328|u23cf|[u23e9-u23f3]|[u23f8-u23fa]|ud83cudccf|u2934|u2935|[u2190-u21ff])"
"+", flags=re.UNICODE)
Why is this still needed when we actually don’t use Python 2.7 that much anymore these days? Some systems/Python implementations still use Python 2.7, like Python UDFs in Amazon Redshift.
I was able to get rid of the emoji in the following ways.
emoji install
https://pypi.org/project/emoji/
$ pip3 install emoji
import emoji
def remove_emoji(string):
return emoji.get_emoji_regexp().sub(u'', string)
emojis = '(`ヘ´) ⭕ ⭐ ⏩'
print(remove_emoji(emojis))
## Output result
(`ヘ´)
Use the Demoji package,
https://pypi.org/project/demoji/
import demoji
text=" "
emoji_less_text = demoji.replace(text, "")
This does more than filtering out just emojis. It removes unicode but tries to do that in a gentle way and replace it with relevant ASCII characters if possible. It can be a blessing in the future if you don’t have for example a dozen of various unicode apostrophes and unicode quotation marks in your text (usually coming from Apple handhelds) but only the regular ASCII apostrophe and quotation.
unicodedata.normalize("NFKD", sentence).encode("ascii", "ignore")
This is robust, I use it with some more guards:
import unicodedata
def neutralize_unicode(value):
"""
Taking care of special characters as gently as possible
Args:
value (string): input string, can contain unicode characters
Returns:
:obj:`string` where the unicode characters are replaced with standard
ASCII counterparts (for example en-dash and em-dash with regular dash,
apostrophe and quotation variations with the standard ones) or taken
out if there's no substitute.
"""
if not value or not isinstance(value, basestring):
return value
if isinstance(value, str):
return value
return unicodedata.normalize("NFKD", value).encode("ascii", "ignore")
This is python 2.
I found two libs to replace emojis:
Emoji: https://pypi.org/project/emoji/
import emoji
string = " "
emoji.replace_emoji(string, replace="!")
Demoji: https://pypi.org/project/demoji/
import demoji
string = " "
demoji.replace(string, repl="!")
Both of them have other useful methods.
I also wanted to remove emojis from a text file. But most of the solutions gave ranges of Unicode to remove emojis, it is not a very appropriate way to do. The remove_emoji method is an in-built method, provided by the clean-text library in Python. We can use it to clean data that has emojis in it. We need to install it from pip in order to use it in our programs:
pip install clean-text
We can use the following syntax to use it:
#import clean function
from cleantext import clean
#provide string with emojis
text = "Hello world! "
#print text after removing the emojis from it
print(clean(text, no_emoji=True))
Output:
Hello world!
I found this code in Python for removing emojis but it is not working. Can you help with other codes or fix to this?
I have observed all my emjois start with xf
but when I try to search for str.startswith("xf")
I get invalid character error.
emoji_pattern = r'/[x{1F601}-x{1F64F}]/u'
re.sub(emoji_pattern, '', word)
Here’s the error:
Traceback (most recent call last):
File "test.py", line 52, in <module>
re.sub(emoji_pattern,'',word)
File "/usr/lib/python2.7/re.py", line 151, in sub
return _compile(pattern, flags).sub(repl, string, count)
File "/usr/lib/python2.7/re.py", line 244, in _compile
raise error, v # invalid expression
sre_constants.error: bad character range
Each of the items in a list can be a word ['This', 'dog', 'xf0x9fx98x82', 'https://t.co/5N86jYipOI']
UPDATE:
I used this other code:
emoji_pattern=re.compile(ur" " " [U0001F600-U0001F64F] # emoticons
|
[U0001F300-U0001F5FF] # symbols & pictographs
|
[U0001F680-U0001F6FF] # transport & map symbols
|
[U0001F1E0-U0001F1FF] # flags (iOS)
" " ", re.VERBOSE)
emoji_pattern.sub('', word)
But this still doesn’t remove the emojis and shows them! Any clue why is that?
Because [...]
means any one of a set of characters, and because two characters in a group separated by a dash means a range of characters (often, “a-z” or “0-9”), your pattern says “a slash, followed by any characters in the group containing x, {, 1, F, 6, 0, 1, the range } through x, {, 1, F, 6, 4, f or }” followed by a slash and the letter u”. That range in the middle is what re is calling the bad character range.
On Python 2, you have to use u''
literal to create a Unicode string. Also, you should pass re.UNICODE
flag and convert your input data to Unicode (e.g., text = data.decode('utf-8')
):
#!/usr/bin/env python
import re
text = u'This dog U0001f602'
print(text) # with emoji
emoji_pattern = re.compile("["
u"U0001F600-U0001F64F" # emoticons
u"U0001F300-U0001F5FF" # symbols & pictographs
u"U0001F680-U0001F6FF" # transport & map symbols
u"U0001F1E0-U0001F1FF" # flags (iOS)
"]+", flags=re.UNICODE)
print(emoji_pattern.sub(r'', text)) # no emoji
Output
This dog
This dog
Note: emoji_pattern
matches only some emoji (not all). See Which Characters are Emoji.
If you’re using the example from the accepted answer and still getting “bad character range” errors, then you’re probably using a narrow build (see this answer for more details). A reformatted version of the regex that seems to work is:
emoji_pattern = re.compile(
u"(ud83d[ude00-ude4f])|" # emoticons
u"(ud83c[udf00-uffff])|" # symbols & pictographs (1 of 2)
u"(ud83d[u0000-uddff])|" # symbols & pictographs (2 of 2)
u"(ud83d[ude80-udeff])|" # transport & map symbols
u"(ud83c[udde0-uddff])" # flags (iOS)
"+", flags=re.UNICODE)
Accepted answer, and others worked for me for a bit, but I ultimately decided to strip all characters outside of the Basic Multilingual Plane. This excludes future additions to other Unicode planes (where emoji’s and such live), which means I don’t have to update my code every time new Unicode characters are added :).
In Python 2.7 convert to unicode if your text is not already, and then use the negative regex below (subs anything not in regex, which is all characters from BMP except for surrogates, which are used to create 2 byte Supplementary Multilingual Plane characters).
NON_BMP_RE = re.compile(u"[^U00000000-U0000d7ffU0000e000-U0000ffff]", flags=re.UNICODE)
NON_BMP_RE.sub(u'', unicode(text, 'utf-8'))
Complete vesrion Of remove emojies:
import re
def remove_emoji(string):
emoji_pattern = re.compile("["
u"U0001F600-U0001F64F" # emoticons
u"U0001F300-U0001F5FF" # symbols & pictographs
u"U0001F680-U0001F6FF" # transport & map symbols
u"U0001F1E0-U0001F1FF" # flags (iOS)
u"U00002702-U000027B0"
u"U000024C2-U0001F251"
"]+", flags=re.UNICODE)
return emoji_pattern.sub(r'', string)
Tried all the answers, unfortunately, they didn’t remove the new hugging face emoji or the clinking glasses emoji or , and a lot more.
Ended up with a list of all possible emoji, taken from the python emoji package on github, and I had to create a gist because there’s a 30k character limit on stackoverflow answers and it’s over 70k characters.
I am updating my answer to this by @jfs because my previous answer failed to account for other Unicode standards such as Latin, Greek etc. StackOverFlow doesn’t allow me to delete my previous answer hence I am updating it to match the most acceptable answer to the question.
#!/usr/bin/env python
import re
text = u'This is a smiley face U0001f602'
print(text) # with emoji
def deEmojify(text):
regrex_pattern = re.compile(pattern = "["
u"U0001F600-U0001F64F" # emoticons
u"U0001F300-U0001F5FF" # symbols & pictographs
u"U0001F680-U0001F6FF" # transport & map symbols
u"U0001F1E0-U0001F1FF" # flags (iOS)
"]+", flags = re.UNICODE)
return regrex_pattern.sub(r'',text)
print(deEmojify(text))
This was my previous answer, do not use this.
def deEmojify(inputString):
return inputString.encode('ascii', 'ignore').decode('ascii')
If you are not keen on using regex, the best solution could be using the emoji python package.
Here is a simple function to return emoji free text (thanks to this SO answer):
import emoji
def give_emoji_free_text(text):
allchars = [str for str in text.decode('utf-8')]
emoji_list = [c for c in allchars if c in emoji.UNICODE_EMOJI]
clean_text = ' '.join([str for str in text.decode('utf-8').split() if not any(i in str for i in emoji_list)])
return clean_text
If you are dealing with strings containing emojis, this is straightforward
>> s1 = "Hi How is your and . Have a nice weekend "
>> print s1
Hi How is your and . Have a nice weekend
>> print give_emoji_free_text(s1)
Hi How is your and Have a nice weekend
If you are dealing with unicode (as in the exmaple by @jfs), just encode it with utf-8.
>> s2 = u'This dog U0001f602'
>> print s2
This dog
>> print give_emoji_free_text(s2.encode('utf8'))
This dog
Edits
Based on the comment, it should be as easy as:
def give_emoji_free_text(text):
return emoji.get_emoji_regexp().sub(r'', text.decode('utf8'))
this is my solution. This solution removes additional man and woman emoji which cant be renered by python ♂ and ♀
emoji_pattern = re.compile("["
u"U0001F600-U0001F64F" # emoticons
u"U0001F300-U0001F5FF" # symbols & pictographs
u"U0001F680-U0001F6FF" # transport & map symbols
u"U0001F1E0-U0001F1FF" # flags (iOS)
u"U00002702-U000027B0"
u"U000024C2-U0001F251"
u"U0001f926-U0001f937"
u"u200d"
u"u2640-u2642"
"]+", flags=re.UNICODE)
Converting the string into another character set like this might help:
text.encode('latin-1', 'ignore').decode('latin-1')
Kind regards.
I tried to collect the complete list of unicodes.
I use it to extract emojis from tweets and it work very well for me.
# Emojis pattern
emoji_pattern = re.compile("["
u"U0001F600-U0001F64F" # emoticons
u"U0001F300-U0001F5FF" # symbols & pictographs
u"U0001F680-U0001F6FF" # transport & map symbols
u"U0001F1E0-U0001F1FF" # flags (iOS)
u"U00002702-U000027B0"
u"U000024C2-U0001F251"
u"U0001f926-U0001f937"
u'U00010000-U0010ffff'
u"u200d"
u"u2640-u2642"
u"u2600-u2B55"
u"u23cf"
u"u23e9"
u"u231a"
u"u3030"
u"ufe0f"
"]+", flags=re.UNICODE)
Here’s a Python 3 script that uses the emoji library’s get_emoji_regexp()
– as suggested by kingmakerking and Martijn Pieters in their answer/comment.
It reads text from a file and writes the emoji-free text to another file.
import emoji
import re
def strip_emoji(text):
print(emoji.emoji_count(text))
new_text = re.sub(emoji.get_emoji_regexp(), r"", text)
return new_text
with open("my_file.md", "r") as file:
old_text = file.read()
no_emoji_text = strip_emoji(old_text)
with open("file.md", "w+") as new_file:
new_file.write(no_emoji_text)
Complete Version of remove Emojis
✍
import re
def remove_emojis(data):
emoj = re.compile("["
u"U0001F600-U0001F64F" # emoticons
u"U0001F300-U0001F5FF" # symbols & pictographs
u"U0001F680-U0001F6FF" # transport & map symbols
u"U0001F1E0-U0001F1FF" # flags (iOS)
u"U00002500-U00002BEF" # chinese char
u"U00002702-U000027B0"
u"U00002702-U000027B0"
u"U000024C2-U0001F251"
u"U0001f926-U0001f937"
u"U00010000-U0010ffff"
u"u2640-u2642"
u"u2600-u2B55"
u"u200d"
u"u23cf"
u"u23e9"
u"u231a"
u"ufe0f" # dingbats
u"u3030"
"]+", re.UNICODE)
return re.sub(emoj, '', data)
The best solution to this will be to use an external library emoji . This library is continuosly updated with latest emojis and thus can be used to find them in any text. Unlike the ascii decode method which remove all unicode characters this method keeps them and only remove emojis.
- First install emoji library if you don’t have:
pip install emoji
- Next import it in your file/project :
import emoji
- Now to remove all emojis use the statement:
emoji.get_emoji_regexp().sub("", msg)
where msg is the text to be edited
That’s all you need.
I know this may not be directly related to question asked but It is helpful in solving the parent problem that is removing emojis from text. There is a module named demoji in python which does this task very accurately and removes almost all types of emojis. It also updates regularly to provide up to date emoji removal support.
For removing an emoji demoji.replace(text, '')
is used.
For me the following worked in python 3.8 for substituting emojis:
import re
result = re.sub('[(U0001F600-U0001F92F|U0001F300-U0001F5FF|U0001F680-U0001F6FF|U0001F190-U0001F1FF|U00002702-U000027B0|U0001F926-U0001FA9F|u200d|u2640-u2642|u2600-u2B55|u23cf|u23e9|u231a|ufe0f)]+','','A quick brown fox jumps over the lazy dog ')
Its a much simplified version of the answers given here.
I tested this code for i18n support, tested with english,russian,chinese and japanese. only emojis were removed.
This is not an exhaustive list , may have missed some emojis, but works for most of the common emojis
This is the easiest code for remove all emoji.
import emoji
def remove_emojis(text: str) -> str:
return ''.join(c for c in text if c not in emoji.UNICODE_EMOJI)
I simply removed all the special characters using regex and this worked for me.
sent_0 = re.sub('[^A-Za-z0-9]+', ' ', sent_0)
For those still using Python 2.7, this regex might help:
(?:[u2700-u27bf]|(?:ud83c[udde6-uddff]){2}|[ud800-udbff][udc00-udfff]|[u0023-u0039]ufe0f?u20e3|u3299|u3297|u303d|u3030|u24c2|ud83c[udd70-udd71]|ud83c[udd7e-udd7f]|ud83cudd8e|ud83c[udd91-udd9a]|ud83c[udde6-uddff]|[ud83cude01-ude02]|ud83cude1a|ud83cude2f|[ud83cude32-ude3a]|[ud83cude50-ude51]|u203c|u2049|[u25aa-u25ab]|u25b6|u25c0|[u25fb-u25fe]|u00a9|u00ae|u2122|u2139|ud83cudc04|[u2600-u26FF]|u2b05|u2b06|u2b07|u2b1b|u2b1c|u2b50|u2b55|u231a|u231b|u2328|u23cf|[u23e9-u23f3]|[u23f8-u23fa]|ud83cudccf|u2934|u2935|[u2190-u21ff])
So to use it in your code, it will somewhat look like this:
emoji_pattern = re.compile(
u"(?:[u2700-u27bf]|(?:ud83c[udde6-uddff]){2}|[ud800-udbff][udc00-udfff]|[u0023-u0039]ufe0f?u20e3|u3299|u3297|u303d|u3030|u24c2|ud83c[udd70-udd71]|ud83c[udd7e-udd7f]|ud83cudd8e|ud83c[udd91-udd9a]|ud83c[udde6-uddff]|[ud83cude01-ude02]|ud83cude1a|ud83cude2f|[ud83cude32-ude3a]|[ud83cude50-ude51]|u203c|u2049|[u25aa-u25ab]|u25b6|u25c0|[u25fb-u25fe]|u00a9|u00ae|u2122|u2139|ud83cudc04|[u2600-u26FF]|u2b05|u2b06|u2b07|u2b1b|u2b1c|u2b50|u2b55|u231a|u231b|u2328|u23cf|[u23e9-u23f3]|[u23f8-u23fa]|ud83cudccf|u2934|u2935|[u2190-u21ff])"
"+", flags=re.UNICODE)
Why is this still needed when we actually don’t use Python 2.7 that much anymore these days? Some systems/Python implementations still use Python 2.7, like Python UDFs in Amazon Redshift.
I was able to get rid of the emoji in the following ways.
emoji install
https://pypi.org/project/emoji/
$ pip3 install emoji
import emoji
def remove_emoji(string):
return emoji.get_emoji_regexp().sub(u'', string)
emojis = '(`ヘ´) ⭕ ⭐ ⏩'
print(remove_emoji(emojis))
## Output result
(`ヘ´)
Use the Demoji package,
https://pypi.org/project/demoji/
import demoji
text=" "
emoji_less_text = demoji.replace(text, "")
This does more than filtering out just emojis. It removes unicode but tries to do that in a gentle way and replace it with relevant ASCII characters if possible. It can be a blessing in the future if you don’t have for example a dozen of various unicode apostrophes and unicode quotation marks in your text (usually coming from Apple handhelds) but only the regular ASCII apostrophe and quotation.
unicodedata.normalize("NFKD", sentence).encode("ascii", "ignore")
This is robust, I use it with some more guards:
import unicodedata
def neutralize_unicode(value):
"""
Taking care of special characters as gently as possible
Args:
value (string): input string, can contain unicode characters
Returns:
:obj:`string` where the unicode characters are replaced with standard
ASCII counterparts (for example en-dash and em-dash with regular dash,
apostrophe and quotation variations with the standard ones) or taken
out if there's no substitute.
"""
if not value or not isinstance(value, basestring):
return value
if isinstance(value, str):
return value
return unicodedata.normalize("NFKD", value).encode("ascii", "ignore")
This is python 2.
I found two libs to replace emojis:
Emoji: https://pypi.org/project/emoji/
import emoji
string = " "
emoji.replace_emoji(string, replace="!")
Demoji: https://pypi.org/project/demoji/
import demoji
string = " "
demoji.replace(string, repl="!")
Both of them have other useful methods.
I also wanted to remove emojis from a text file. But most of the solutions gave ranges of Unicode to remove emojis, it is not a very appropriate way to do. The remove_emoji method is an in-built method, provided by the clean-text library in Python. We can use it to clean data that has emojis in it. We need to install it from pip in order to use it in our programs:
pip install clean-text
We can use the following syntax to use it:
#import clean function
from cleantext import clean
#provide string with emojis
text = "Hello world! "
#print text after removing the emojis from it
print(clean(text, no_emoji=True))
Output:
Hello world!