How to remove non-ascii characters from a list
Question:
I have an object type DataFrame
with some elements that are text and some are numbers.
when I convert a column to a list, some of the elements have non-ascii characters.
Is there a way to get rid of the characters, like .encode('ascii', 'ignore')
but for a list?
here is the list that I get:
['Central Parku202c',
'Top of the Rock',
'Statue of Libertyu202c',
'Brooklyn Bridge'
]
Answers:
You can use the str
accessor:
df.my_column.str.encode('ascii','ignore').str.decode('ascii').tolist()
If you want to post-process your list, you can apply encode('ascii', 'ignore')
over it:
my_list = [
'Central Parku202c',
'Top of the Rock',
'Statue of Libertyu202c',
'Brooklyn Bridge'
]
my_list = [e.encode('ascii', 'ignore').decode("utf-8") for e in my_list]
print(my_list)
And the output should be:
['Central Park', 'Top of the Rock', 'Statue of Liberty', 'Brooklyn Bridge']
I have an object type DataFrame
with some elements that are text and some are numbers.
when I convert a column to a list, some of the elements have non-ascii characters.
Is there a way to get rid of the characters, like .encode('ascii', 'ignore')
but for a list?
here is the list that I get:
['Central Parku202c',
'Top of the Rock',
'Statue of Libertyu202c',
'Brooklyn Bridge'
]
You can use the str
accessor:
df.my_column.str.encode('ascii','ignore').str.decode('ascii').tolist()
If you want to post-process your list, you can apply encode('ascii', 'ignore')
over it:
my_list = [
'Central Parku202c',
'Top of the Rock',
'Statue of Libertyu202c',
'Brooklyn Bridge'
]
my_list = [e.encode('ascii', 'ignore').decode("utf-8") for e in my_list]
print(my_list)
And the output should be:
['Central Park', 'Top of the Rock', 'Statue of Liberty', 'Brooklyn Bridge']