Creating labels based on partial string matches in Python

Question:

I have an index column that’s leading 2 or 3 characters indicating a category I’d like to create labels for. Consider the following data:

Index
NDP2207342
OC22178098
SD88730948
OC39002847
PTP9983930
NDP9110876

with a desired output of:

Index Labels
NDP2207342 NCAR
OC22178098 OCAR
SD88730948 SDAR
OC39002847 OCAR
PTP9983930 SWAR
NDP9110876 NCAR

Unfortunately the number of leading characters is not consistent (could be two or three), but those leading characters do consistently map to the categories of interest.

My attempt was hoping to do something as simple (or like) a SQL wildcard search, but unfortunately didn’t work (which is obvious in retrospect). Here’s what I had:


def labelApply(x):
    '''
    Applying labels for categories
    '''
    if x == 'OC%': return 'OCAR'
    elif x == 'NDP%': return 'NCAR'
    elif x == 'PTP%': return 'SWAR'
    elif x == 'SD%' : return 'SDAR'
    else: return 'Out of Area'
df['labels'] = df['index'].apply(labelApply)

But that didn’t work.

Any thoughts?

Asked By: JLuu

||

Answers:

Try:

import re

pat = re.compile(r"d+")

def labelApply(x):
    """
    Applying labels for categories
    """
    x = pat.sub("", x)
    if x == "OC":
        return "OCAR"
    elif x == "NDP":
        return "NCAR"
    elif x == "PTP":
        return "SWAR"
    elif x == "SD":
        return "SDAR"
    else:
        return "Out of Area"


df["labels"] = df["Index"].apply(labelApply)
print(df)

Prints:

        Index labels
0  NDP2207342   NCAR
1  OC22178098   OCAR
2  SD88730948   SDAR
3  OC39002847   OCAR
4  PTP9983930   SWAR
5  NDP9110876   NCAR
Answered By: Andrej Kesely
# Identify the pairings
replace_dict = {
    "OC": "OCAR",
    "NDP": "NCAR",
    "PTP": "SWAR",
    "SD": "SDAR",
}

# Make the new column
df['labels'] = df['index'].str.extract("([A-Z]+)")
# Replace Unknown
df.loc[~df['labels'].isin(replace_dict), 'labels'] = 'Out of Area'
# Replace Known
df['labels'] = df['labels'].replace(replace_dict)

print(df)

Output:

        index labels
0  NDP2207342   NCAR
1  OC22178098   OCAR
2  SD88730948   SDAR
3  OC39002847   OCAR
4  PTP9983930   SWAR
5  NDP9110876   NCAR
Answered By: BeRT2me
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.