Replace by NaN if string contains digits or symbols
Question:
I have a dataframe and I need to identify values that contain numbers or symbols in order to eliminate them. Only letters and spaces are allowed. The size of the dataframe is quite big and what I am trying doesn’t result in anything:
df.NAME=df.NAME.replace(r"(/^[a-zA-Zs]*$/)",np.nan,regex=True)
Any suggestions?
Thank you
Answers:
If you need to only keep items with letters and spaces only, you need a silution based on Series.str.contains
, not replace
:
df['NAME']=df[df['NAME'].str.contains(r"^[a-zA-Zs]*$", regex=True)]
That will keep all those items in NAME
column that only contain ASCII letters or/and whitespaces.
To support any Unicode letters, you’d need
df['NAME']=df[df['NAME'].str.contains(r"^(?:[^Wd_]|s)*$", regex=True)]
where (?:[^Wd_]|s)
matches either any Unicode letter (together with most diacritics) or a whitespace char.
I have a dataframe and I need to identify values that contain numbers or symbols in order to eliminate them. Only letters and spaces are allowed. The size of the dataframe is quite big and what I am trying doesn’t result in anything:
df.NAME=df.NAME.replace(r"(/^[a-zA-Zs]*$/)",np.nan,regex=True)
Any suggestions?
Thank you
If you need to only keep items with letters and spaces only, you need a silution based on Series.str.contains
, not replace
:
df['NAME']=df[df['NAME'].str.contains(r"^[a-zA-Zs]*$", regex=True)]
That will keep all those items in NAME
column that only contain ASCII letters or/and whitespaces.
To support any Unicode letters, you’d need
df['NAME']=df[df['NAME'].str.contains(r"^(?:[^Wd_]|s)*$", regex=True)]
where (?:[^Wd_]|s)
matches either any Unicode letter (together with most diacritics) or a whitespace char.