Pandas find multiple words from a list and assign Boolean value if found

Question:

So, I have dataframe like this,

data = {
  "properties": ["FinancialOffice","Gas Station", "Office", "K-12 School", "Commercial, Office"],
}
df = pd.DataFrame(data)

This is my list,

proplist = ["Office","Other - Mall","Gym"]

what I am trying to do is using the list I am trying to find out which words exactly matches with the dataframe column and for each word from the dataframe I need to assign a Boolean true/false value or 0/1. It has to be a exact match.

Output like this,

properties         flag
FinancialOffice    FALSE
Gas Station        FALSE
Office             TRUE
K-12 School        FALSE
Commercial, Office TRUE

So, It returns TRUE for only "Office" because it is the exact match from the list. FinancialOffice is not because it is not in the list. Also, For the last one Commercial, Office it is TRUE because Office is found in the list even though Commercial not. So, even one of them is present it will be TRUE.

df["flag"] = df["properties"].isin(proplist)

Above code works fine to assign a boolean true/false but It returns FALSE for the last one(Commercial,Office) as it tries to find the exact match.

Any help is appreciated.

Asked By: Tahsin Alam

||

Answers:

Use a crafted regex with word delimiter:

import re

regex = r'b(?:%s)b' % '|'.join(map(re.escape, proplist))
# '\b(?:Office|Other\ \-\ Mall|Gym)\b'

df['flag'] = df['properties'].str.contains(regex, regex=True)
# for a case insensitive match add the case=False parameter

output:

           properties   flag
0     FinancialOffice  False
1         Gas Station  False
2              Office   True
3         K-12 School  False
4  Commercial, Office   True
Answered By: mozway

You can define an external function to do the check, for example

import pandas as pd

data = {"properties": ["FinancialOffice", "Gas Station", "Office", "K-12 School", "Commercial, Office"]}
df = pd.DataFrame(data)
proplist = ["Office", "Other - Mall", "Gym"]

def check_present(cell):
    for word in (list(e.strip() for e in cell.split(','))):
        if word in proplist:
            return 'TRUE'
    return 'FALSE'

df['flag'] = df['properties'].apply(lambda x: check_present(x))
print(df)

Output:

           properties   flag
0     FinancialOffice  FALSE
1         Gas Station  FALSE
2              Office   TRUE
3         K-12 School  FALSE
4  Commercial, Office   TRUE
Answered By: perpetualstudent

You can use split() and strip() to convert each properties string of comma-delimited properties to a list of strings, then use the python set intersection operator & to test whether any of the properties match those in proplist:

propset = set(proplist)
df['flag'] = (
    df.properties.str.split(',')
    .apply(lambda x: len({s.strip() for s in x} & propset) > 0))

Output:

           properties   flag
0     FinancialOffice  False
1         Gas Station  False
2              Office   True
3         K-12 School  False
4  Commercial, Office   True
Answered By: constantstranger