Creating labels based on partial string matches in Python
Question:
I have an index column that’s leading 2 or 3 characters indicating a category I’d like to create labels for. Consider the following data:
Index
NDP2207342
OC22178098
SD88730948
OC39002847
PTP9983930
NDP9110876
with a desired output of:
Index
Labels
NDP2207342
NCAR
OC22178098
OCAR
SD88730948
SDAR
OC39002847
OCAR
PTP9983930
SWAR
NDP9110876
NCAR
Unfortunately the number of leading characters is not consistent (could be two or three), but those leading characters do consistently map to the categories of interest.
My attempt was hoping to do something as simple (or like) a SQL wildcard search, but unfortunately didn’t work (which is obvious in retrospect). Here’s what I had:
def labelApply(x):
'''
Applying labels for categories
'''
if x == 'OC%': return 'OCAR'
elif x == 'NDP%': return 'NCAR'
elif x == 'PTP%': return 'SWAR'
elif x == 'SD%' : return 'SDAR'
else: return 'Out of Area'
df['labels'] = df['index'].apply(labelApply)
But that didn’t work.
Any thoughts?
Answers:
Try:
import re
pat = re.compile(r"d+")
def labelApply(x):
"""
Applying labels for categories
"""
x = pat.sub("", x)
if x == "OC":
return "OCAR"
elif x == "NDP":
return "NCAR"
elif x == "PTP":
return "SWAR"
elif x == "SD":
return "SDAR"
else:
return "Out of Area"
df["labels"] = df["Index"].apply(labelApply)
print(df)
Prints:
Index labels
0 NDP2207342 NCAR
1 OC22178098 OCAR
2 SD88730948 SDAR
3 OC39002847 OCAR
4 PTP9983930 SWAR
5 NDP9110876 NCAR
# Identify the pairings
replace_dict = {
"OC": "OCAR",
"NDP": "NCAR",
"PTP": "SWAR",
"SD": "SDAR",
}
# Make the new column
df['labels'] = df['index'].str.extract("([A-Z]+)")
# Replace Unknown
df.loc[~df['labels'].isin(replace_dict), 'labels'] = 'Out of Area'
# Replace Known
df['labels'] = df['labels'].replace(replace_dict)
print(df)
Output:
index labels
0 NDP2207342 NCAR
1 OC22178098 OCAR
2 SD88730948 SDAR
3 OC39002847 OCAR
4 PTP9983930 SWAR
5 NDP9110876 NCAR
I have an index column that’s leading 2 or 3 characters indicating a category I’d like to create labels for. Consider the following data:
Index |
---|
NDP2207342 |
OC22178098 |
SD88730948 |
OC39002847 |
PTP9983930 |
NDP9110876 |
with a desired output of:
Index | Labels |
---|---|
NDP2207342 | NCAR |
OC22178098 | OCAR |
SD88730948 | SDAR |
OC39002847 | OCAR |
PTP9983930 | SWAR |
NDP9110876 | NCAR |
Unfortunately the number of leading characters is not consistent (could be two or three), but those leading characters do consistently map to the categories of interest.
My attempt was hoping to do something as simple (or like) a SQL wildcard search, but unfortunately didn’t work (which is obvious in retrospect). Here’s what I had:
def labelApply(x):
'''
Applying labels for categories
'''
if x == 'OC%': return 'OCAR'
elif x == 'NDP%': return 'NCAR'
elif x == 'PTP%': return 'SWAR'
elif x == 'SD%' : return 'SDAR'
else: return 'Out of Area'
df['labels'] = df['index'].apply(labelApply)
But that didn’t work.
Any thoughts?
Try:
import re
pat = re.compile(r"d+")
def labelApply(x):
"""
Applying labels for categories
"""
x = pat.sub("", x)
if x == "OC":
return "OCAR"
elif x == "NDP":
return "NCAR"
elif x == "PTP":
return "SWAR"
elif x == "SD":
return "SDAR"
else:
return "Out of Area"
df["labels"] = df["Index"].apply(labelApply)
print(df)
Prints:
Index labels
0 NDP2207342 NCAR
1 OC22178098 OCAR
2 SD88730948 SDAR
3 OC39002847 OCAR
4 PTP9983930 SWAR
5 NDP9110876 NCAR
# Identify the pairings
replace_dict = {
"OC": "OCAR",
"NDP": "NCAR",
"PTP": "SWAR",
"SD": "SDAR",
}
# Make the new column
df['labels'] = df['index'].str.extract("([A-Z]+)")
# Replace Unknown
df.loc[~df['labels'].isin(replace_dict), 'labels'] = 'Out of Area'
# Replace Known
df['labels'] = df['labels'].replace(replace_dict)
print(df)
Output:
index labels
0 NDP2207342 NCAR
1 OC22178098 OCAR
2 SD88730948 SDAR
3 OC39002847 OCAR
4 PTP9983930 SWAR
5 NDP9110876 NCAR