How to remove accents from values in columns?
Question:
How do I change the special characters to the usual alphabet letters?
This is my dataframe:
In [56]: cities
Out[56]:
Table Code Country Year City Value
240 Åland Islands 2014.0 MARIEHAMN 11437.0 1
240 Åland Islands 2010.0 MARIEHAMN 5829.5 1
240 Albania 2011.0 Durrës 113249.0
240 Albania 2011.0 TIRANA 418495.0
240 Albania 2011.0 Durrës 56511.0
I want it to look like this:
In [56]: cities
Out[56]:
Table Code Country Year City Value
240 Aland Islands 2014.0 MARIEHAMN 11437.0 1
240 Aland Islands 2010.0 MARIEHAMN 5829.5 1
240 Albania 2011.0 Durres 113249.0
240 Albania 2011.0 TIRANA 418495.0
240 Albania 2011.0 Durres 56511.0
Answers:
This is for Python 2.7. For converting to ASCII you might want to try:
import unicodedata
unicodedata.normalize('NFKD', u"Durrës Åland Islands").encode('ascii','ignore')
'Durres Aland Islands'
The pandas method is to use the vectorised str.normalize
combined with str.decode
and str.encode
:
In [60]:
df['Country'].str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8')
Out[60]:
0 Aland Islands
1 Aland Islands
2 Albania
3 Albania
4 Albania
Name: Country, dtype: object
So to do this for all str
dtypes:
In [64]:
cols = df.select_dtypes(include=[np.object]).columns
df[cols] = df[cols].apply(lambda x: x.str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8'))
df
Out[64]:
Table Code Country Year City Value
0 240 Aland Islands 2014.0 MARIEHAMN 11437.0 1
1 240 Aland Islands 2010.0 MARIEHAMN 5829.5 1
2 240 Albania 2011.0 Durres 113249.0
3 240 Albania 2011.0 TIRANA 418495.0
4 240 Albania 2011.0 Durres 56511.0
Use this code:
df['Country'] = df['Country'].str.replace(u"Å", "A")
df['City'] = df['City'].str.replace(u"ë", "e")
See here! Of course you should do it then for every special character and every column.
With pandas series example
def remove_accents(a):
return unidecode.unidecode(a.decode('utf-8'))
df['column'] = df['column'].apply(remove_accents)
in this case decode asciis
I want to remove all de accents in all the names of columns so I used
df.columns = df.columns.str.normalize('NFKD').str.encode('ascii',errors='ignore').str.decode('utf-8')
How do I change the special characters to the usual alphabet letters?
This is my dataframe:
In [56]: cities
Out[56]:
Table Code Country Year City Value
240 Åland Islands 2014.0 MARIEHAMN 11437.0 1
240 Åland Islands 2010.0 MARIEHAMN 5829.5 1
240 Albania 2011.0 Durrës 113249.0
240 Albania 2011.0 TIRANA 418495.0
240 Albania 2011.0 Durrës 56511.0
I want it to look like this:
In [56]: cities
Out[56]:
Table Code Country Year City Value
240 Aland Islands 2014.0 MARIEHAMN 11437.0 1
240 Aland Islands 2010.0 MARIEHAMN 5829.5 1
240 Albania 2011.0 Durres 113249.0
240 Albania 2011.0 TIRANA 418495.0
240 Albania 2011.0 Durres 56511.0
This is for Python 2.7. For converting to ASCII you might want to try:
import unicodedata
unicodedata.normalize('NFKD', u"Durrës Åland Islands").encode('ascii','ignore')
'Durres Aland Islands'
The pandas method is to use the vectorised str.normalize
combined with str.decode
and str.encode
:
In [60]:
df['Country'].str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8')
Out[60]:
0 Aland Islands
1 Aland Islands
2 Albania
3 Albania
4 Albania
Name: Country, dtype: object
So to do this for all str
dtypes:
In [64]:
cols = df.select_dtypes(include=[np.object]).columns
df[cols] = df[cols].apply(lambda x: x.str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8'))
df
Out[64]:
Table Code Country Year City Value
0 240 Aland Islands 2014.0 MARIEHAMN 11437.0 1
1 240 Aland Islands 2010.0 MARIEHAMN 5829.5 1
2 240 Albania 2011.0 Durres 113249.0
3 240 Albania 2011.0 TIRANA 418495.0
4 240 Albania 2011.0 Durres 56511.0
Use this code:
df['Country'] = df['Country'].str.replace(u"Å", "A")
df['City'] = df['City'].str.replace(u"ë", "e")
See here! Of course you should do it then for every special character and every column.
With pandas series example
def remove_accents(a):
return unidecode.unidecode(a.decode('utf-8'))
df['column'] = df['column'].apply(remove_accents)
in this case decode asciis
I want to remove all de accents in all the names of columns so I used
df.columns = df.columns.str.normalize('NFKD').str.encode('ascii',errors='ignore').str.decode('utf-8')