Creating labels based on partial string matches in Python

Question

I have an index column that’s leading 2 or 3 characters indicating a category I’d like to create labels for. Consider the following data:

Index
NDP2207342
OC22178098
SD88730948
OC39002847
PTP9983930
NDP9110876

with a desired output of:

Index	Labels
NDP2207342	NCAR
OC22178098	OCAR
SD88730948	SDAR
OC39002847	OCAR
PTP9983930	SWAR
NDP9110876	NCAR

Unfortunately the number of leading characters is not consistent (could be two or three), but those leading characters do consistently map to the categories of interest.

My attempt was hoping to do something as simple (or like) a SQL wildcard search, but unfortunately didn’t work (which is obvious in retrospect). Here’s what I had:


def labelApply(x):
    '''
    Applying labels for categories
    '''
    if x == 'OC%': return 'OCAR'
    elif x == 'NDP%': return 'NCAR'
    elif x == 'PTP%': return 'SWAR'
    elif x == 'SD%' : return 'SDAR'
    else: return 'Out of Area'
df['labels'] = df['index'].apply(labelApply)

But that didn’t work.

Any thoughts?

Asked By: JLuu

||

Source

Answer 1

Try:

import re

pat = re.compile(r"d+")

def labelApply(x):
    """
    Applying labels for categories
    """
    x = pat.sub("", x)
    if x == "OC":
        return "OCAR"
    elif x == "NDP":
        return "NCAR"
    elif x == "PTP":
        return "SWAR"
    elif x == "SD":
        return "SDAR"
    else:
        return "Out of Area"


df["labels"] = df["Index"].apply(labelApply)
print(df)

Prints:

        Index labels
0  NDP2207342   NCAR
1  OC22178098   OCAR
2  SD88730948   SDAR
3  OC39002847   OCAR
4  PTP9983930   SWAR
5  NDP9110876   NCAR

Answered By: Andrej Kesely

Answer 2

# Identify the pairings
replace_dict = {
    "OC": "OCAR",
    "NDP": "NCAR",
    "PTP": "SWAR",
    "SD": "SDAR",
}

# Make the new column
df['labels'] = df['index'].str.extract("([A-Z]+)")
# Replace Unknown
df.loc[~df['labels'].isin(replace_dict), 'labels'] = 'Out of Area'
# Replace Known
df['labels'] = df['labels'].replace(replace_dict)

print(df)

Output:

        index labels
0  NDP2207342   NCAR
1  OC22178098   OCAR
2  SD88730948   SDAR
3  OC39002847   OCAR
4  PTP9983930   SWAR
5  NDP9110876   NCAR

Answered By: BeRT2me

Creating labels based on partial string matches in Python

Question:

Answers: