Removing emojis and special characters in Python
Question:
I hate a dataset that looks like this called df_bios
:
{‘userid’: {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7}, ‘text_string’: {0: ‘I live in Miami and work in software’, 1: ‘Chicago, IL’, 2: ‘Dog Mom in Cincinnati , 3: ‘Accountant at @EY/Baltimore’, 4: ‘World traveler but I call Atlanta home’, 5: ‘⚡️ ❤️
sc/-emmabrown1133
@shefit EMMA15 ‘, 6: ‘Working in Orlando. From Korea.’}}
I’m trying to remove all the unnecessary emojis (as well as any other special characters, symbols, pictographs, etc…)
I tried using the answer provided here, but it didn’t do anything:
import re
def remove_emojis(df_bios):
emoj = re.compile("["
u"U0001F600-U0001F64F" # emoticons
u"U0001F300-U0001F5FF" # symbols & pictographs
u"U0001F680-U0001F6FF" # transport & map symbols
u"U0001F1E0-U0001F1FF" # flags (iOS)
u"U00002500-U00002BEF" # chinese char
u"U00002702-U000027B0"
u"U00002702-U000027B0"
u"U000024C2-U0001F251"
u"U0001f926-U0001f937"
u"U00010000-U0010ffff"
u"u2640-u2642"
u"u2600-u2B55"
u"u200d"
u"u23cf"
u"u23e9"
u"u231a"
u"ufe0f" # dingbats
u"u3030"
"]+", re.UNICODE)
return re.sub(emoj, '', df_bios)
It didn’t return any errors, it just returned the same data without any changes.
Answers:
You can apply your remove_emojis
function to your dataframe column. This will replace your emojis with nothing.
import pandas as pd
def remove_emojis(df_bios):
emoj = re.compile("["
u"U0001F600-U0001F64F" # emoticons
u"U0001F300-U0001F5FF" # symbols & pictographs
u"U0001F680-U0001F6FF" # transport & map symbols
u"U0001F1E0-U0001F1FF" # flags (iOS)
u"U00002500-U00002BEF" # chinese char
u"U00002702-U000027B0"
u"U00002702-U000027B0"
u"U000024C2-U0001F251"
u"U0001f926-U0001f937"
u"U00010000-U0010ffff"
u"u2640-u2642"
u"u2600-u2B55"
u"u200d"
u"u23cf"
u"u23e9"
u"u231a"
u"ufe0f" # dingbats
u"u3030"
"]+", re.UNICODE)
return re.sub(emoj, '', df_bios)
data = {'userid': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7}, 'text_string': {0: 'I live in Miami and work in software', 1: 'Chicago, IL', 2: 'Dog Mom in Cincinnati ', 3: 'Accountant at @EY/Baltimore', 4: 'World traveler but I call Atlanta home', 5: '⚡️ ❤️ sc/-emmabrown1133@shefit EMMA15 ', 6: 'Working in Orlando. From Korea.'}}
df_bios = pd.DataFrame(data)
df_bios.text_string = df_bios['text_string'].apply(remove_emojis)
Outputs
userid text_string
0 1 I live in Miami and work in software
1 2 Chicago, IL
2 3 Dog Mom in Cincinnati
3 4 Accountant at @EY/Baltimore
4 5 World traveler but I call Atlanta home
5 6 sc/-emmabrown1133@shefit EMMA15
6 7 Working in Orlando. From Korea.
I hate a dataset that looks like this called df_bios
:
{‘userid’: {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7}, ‘text_string’: {0: ‘I live in Miami and work in software’, 1: ‘Chicago, IL’, 2: ‘Dog Mom in Cincinnati , 3: ‘Accountant at @EY/Baltimore’, 4: ‘World traveler but I call Atlanta home’, 5: ‘⚡️ ❤️
sc/-emmabrown1133
@shefit EMMA15 ‘, 6: ‘Working in Orlando. From Korea.’}}
I’m trying to remove all the unnecessary emojis (as well as any other special characters, symbols, pictographs, etc…)
I tried using the answer provided here, but it didn’t do anything:
import re
def remove_emojis(df_bios):
emoj = re.compile("["
u"U0001F600-U0001F64F" # emoticons
u"U0001F300-U0001F5FF" # symbols & pictographs
u"U0001F680-U0001F6FF" # transport & map symbols
u"U0001F1E0-U0001F1FF" # flags (iOS)
u"U00002500-U00002BEF" # chinese char
u"U00002702-U000027B0"
u"U00002702-U000027B0"
u"U000024C2-U0001F251"
u"U0001f926-U0001f937"
u"U00010000-U0010ffff"
u"u2640-u2642"
u"u2600-u2B55"
u"u200d"
u"u23cf"
u"u23e9"
u"u231a"
u"ufe0f" # dingbats
u"u3030"
"]+", re.UNICODE)
return re.sub(emoj, '', df_bios)
It didn’t return any errors, it just returned the same data without any changes.
You can apply your remove_emojis
function to your dataframe column. This will replace your emojis with nothing.
import pandas as pd
def remove_emojis(df_bios):
emoj = re.compile("["
u"U0001F600-U0001F64F" # emoticons
u"U0001F300-U0001F5FF" # symbols & pictographs
u"U0001F680-U0001F6FF" # transport & map symbols
u"U0001F1E0-U0001F1FF" # flags (iOS)
u"U00002500-U00002BEF" # chinese char
u"U00002702-U000027B0"
u"U00002702-U000027B0"
u"U000024C2-U0001F251"
u"U0001f926-U0001f937"
u"U00010000-U0010ffff"
u"u2640-u2642"
u"u2600-u2B55"
u"u200d"
u"u23cf"
u"u23e9"
u"u231a"
u"ufe0f" # dingbats
u"u3030"
"]+", re.UNICODE)
return re.sub(emoj, '', df_bios)
data = {'userid': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7}, 'text_string': {0: 'I live in Miami and work in software', 1: 'Chicago, IL', 2: 'Dog Mom in Cincinnati ', 3: 'Accountant at @EY/Baltimore', 4: 'World traveler but I call Atlanta home', 5: '⚡️ ❤️ sc/-emmabrown1133@shefit EMMA15 ', 6: 'Working in Orlando. From Korea.'}}
df_bios = pd.DataFrame(data)
df_bios.text_string = df_bios['text_string'].apply(remove_emojis)
Outputs
userid text_string
0 1 I live in Miami and work in software
1 2 Chicago, IL
2 3 Dog Mom in Cincinnati
3 4 Accountant at @EY/Baltimore
4 5 World traveler but I call Atlanta home
5 6 sc/-emmabrown1133@shefit EMMA15
6 7 Working in Orlando. From Korea.