Replace by NaN if string contains digits or symbols

Question:

I have a dataframe and I need to identify values that contain numbers or symbols in order to eliminate them. Only letters and spaces are allowed. The size of the dataframe is quite big and what I am trying doesn’t result in anything:

df.NAME=df.NAME.replace(r"(/^[a-zA-Zs]*$/)",np.nan,regex=True)

Any suggestions?
Thank you

Asked By: Sapehi

||

Answers:

If you need to only keep items with letters and spaces only, you need a silution based on Series.str.contains, not replace:

df['NAME']=df[df['NAME'].str.contains(r"^[a-zA-Zs]*$", regex=True)]

That will keep all those items in NAME column that only contain ASCII letters or/and whitespaces.

To support any Unicode letters, you’d need

df['NAME']=df[df['NAME'].str.contains(r"^(?:[^Wd_]|s)*$", regex=True)]

where (?:[^Wd_]|s) matches either any Unicode letter (together with most diacritics) or a whitespace char.

Answered By: Wiktor Stribiżew
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.