how to extract only letter from a string mixed with numbers with python
Question:
I have this table in my dataframe, the char
column is mixed either with letters only, numbers only or the combination between letters and numbers.
char count
123 24
test 25
te123 26
test123 26
I want to extract only the letters, and if the rows has numbers only then I want to make it blank.
The expected results would be:
char count
NaN 24
test 25
te 26
test 26
How can I do this in python?
Thank you in advance
Answers:
You can use regex to do this.
import pandas as pd
import numpy as np
import re
data = {'char': ['123', 'test', 'te123', 'test123'], 'count': [24, 25, 26, 26]}
df = pd.DataFrame(data)
df['char'] = df['char'].apply(lambda x: re.sub('[^a-zA-Z]+', '', x) if bool(re.search('[a-zA-Z]', x)) else np.nan)
print(df)
Here re.sub('[^a-zA-Z]+', '', x)
removes all non letter chars from the string and the next regex bool(re.search('[a-zA-Z]', x))
checks if the resulting string contains a letter else makes it NaN.
You can use extract
:
df["char"] = df["char"].str.extract("([a-zA-Z]+)", expand=False)
If you have intermittent characters like "te12s3t"
, use findall
:
df["char"] = df["char"].str.findall("([a-zA-Z]+)").str.join("")
Or simply replace
to handle both cases :
df["char"] = df["char"].replace("d+", "", regex=True).mask(lambda s: s.eq(""))
Or in a @Corralien way, use isdigit
combined with replace :
df["char"] = df["char"].mask(df["char"].str.isdigit()).str.replace(r"d+", "", regex=True)
Output :
print(df)
char count
0 NaN 24
1 test 25
2 te 26
3 test 26
We can use str.replace
as follows:
df["char"] = df["char"].str.replace(r'd+', '', regex=True)
I have this table in my dataframe, the char
column is mixed either with letters only, numbers only or the combination between letters and numbers.
char count
123 24
test 25
te123 26
test123 26
I want to extract only the letters, and if the rows has numbers only then I want to make it blank.
The expected results would be:
char count
NaN 24
test 25
te 26
test 26
How can I do this in python?
Thank you in advance
You can use regex to do this.
import pandas as pd
import numpy as np
import re
data = {'char': ['123', 'test', 'te123', 'test123'], 'count': [24, 25, 26, 26]}
df = pd.DataFrame(data)
df['char'] = df['char'].apply(lambda x: re.sub('[^a-zA-Z]+', '', x) if bool(re.search('[a-zA-Z]', x)) else np.nan)
print(df)
Here re.sub('[^a-zA-Z]+', '', x)
removes all non letter chars from the string and the next regex bool(re.search('[a-zA-Z]', x))
checks if the resulting string contains a letter else makes it NaN.
You can use extract
:
df["char"] = df["char"].str.extract("([a-zA-Z]+)", expand=False)
If you have intermittent characters like "te12s3t"
, use findall
:
df["char"] = df["char"].str.findall("([a-zA-Z]+)").str.join("")
Or simply replace
to handle both cases :
df["char"] = df["char"].replace("d+", "", regex=True).mask(lambda s: s.eq(""))
Or in a @Corralien way, use isdigit
combined with replace :
df["char"] = df["char"].mask(df["char"].str.isdigit()).str.replace(r"d+", "", regex=True)
Output :
print(df)
char count
0 NaN 24
1 test 25
2 te 26
3 test 26
We can use str.replace
as follows:
df["char"] = df["char"].str.replace(r'd+', '', regex=True)