Pandas find multiple words from a list and assign Boolean value if found
Question:
So, I have dataframe like this,
data = {
"properties": ["FinancialOffice","Gas Station", "Office", "K-12 School", "Commercial, Office"],
}
df = pd.DataFrame(data)
This is my list,
proplist = ["Office","Other - Mall","Gym"]
what I am trying to do is using the list I am trying to find out which words exactly matches with the dataframe column and for each word from the dataframe I need to assign a Boolean true/false value or 0/1. It has to be a exact match.
Output like this,
properties flag
FinancialOffice FALSE
Gas Station FALSE
Office TRUE
K-12 School FALSE
Commercial, Office TRUE
So, It returns TRUE for only "Office" because it is the exact match from the list. FinancialOffice is not because it is not in the list. Also, For the last one Commercial, Office it is TRUE because Office is found in the list even though Commercial not. So, even one of them is present it will be TRUE.
df["flag"] = df["properties"].isin(proplist)
Above code works fine to assign a boolean true/false but It returns FALSE for the last one(Commercial,Office) as it tries to find the exact match.
Any help is appreciated.
Answers:
Use a crafted regex with word delimiter:
import re
regex = r'b(?:%s)b' % '|'.join(map(re.escape, proplist))
# '\b(?:Office|Other\ \-\ Mall|Gym)\b'
df['flag'] = df['properties'].str.contains(regex, regex=True)
# for a case insensitive match add the case=False parameter
output:
properties flag
0 FinancialOffice False
1 Gas Station False
2 Office True
3 K-12 School False
4 Commercial, Office True
You can define an external function to do the check, for example
import pandas as pd
data = {"properties": ["FinancialOffice", "Gas Station", "Office", "K-12 School", "Commercial, Office"]}
df = pd.DataFrame(data)
proplist = ["Office", "Other - Mall", "Gym"]
def check_present(cell):
for word in (list(e.strip() for e in cell.split(','))):
if word in proplist:
return 'TRUE'
return 'FALSE'
df['flag'] = df['properties'].apply(lambda x: check_present(x))
print(df)
Output:
properties flag
0 FinancialOffice FALSE
1 Gas Station FALSE
2 Office TRUE
3 K-12 School FALSE
4 Commercial, Office TRUE
You can use split()
and strip()
to convert each properties
string of comma-delimited properties to a list of strings, then use the python set
intersection operator &
to test whether any of the properties match those in proplist
:
propset = set(proplist)
df['flag'] = (
df.properties.str.split(',')
.apply(lambda x: len({s.strip() for s in x} & propset) > 0))
Output:
properties flag
0 FinancialOffice False
1 Gas Station False
2 Office True
3 K-12 School False
4 Commercial, Office True
So, I have dataframe like this,
data = {
"properties": ["FinancialOffice","Gas Station", "Office", "K-12 School", "Commercial, Office"],
}
df = pd.DataFrame(data)
This is my list,
proplist = ["Office","Other - Mall","Gym"]
what I am trying to do is using the list I am trying to find out which words exactly matches with the dataframe column and for each word from the dataframe I need to assign a Boolean true/false value or 0/1. It has to be a exact match.
Output like this,
properties flag
FinancialOffice FALSE
Gas Station FALSE
Office TRUE
K-12 School FALSE
Commercial, Office TRUE
So, It returns TRUE for only "Office" because it is the exact match from the list. FinancialOffice is not because it is not in the list. Also, For the last one Commercial, Office it is TRUE because Office is found in the list even though Commercial not. So, even one of them is present it will be TRUE.
df["flag"] = df["properties"].isin(proplist)
Above code works fine to assign a boolean true/false but It returns FALSE for the last one(Commercial,Office) as it tries to find the exact match.
Any help is appreciated.
Use a crafted regex with word delimiter:
import re
regex = r'b(?:%s)b' % '|'.join(map(re.escape, proplist))
# '\b(?:Office|Other\ \-\ Mall|Gym)\b'
df['flag'] = df['properties'].str.contains(regex, regex=True)
# for a case insensitive match add the case=False parameter
output:
properties flag
0 FinancialOffice False
1 Gas Station False
2 Office True
3 K-12 School False
4 Commercial, Office True
You can define an external function to do the check, for example
import pandas as pd
data = {"properties": ["FinancialOffice", "Gas Station", "Office", "K-12 School", "Commercial, Office"]}
df = pd.DataFrame(data)
proplist = ["Office", "Other - Mall", "Gym"]
def check_present(cell):
for word in (list(e.strip() for e in cell.split(','))):
if word in proplist:
return 'TRUE'
return 'FALSE'
df['flag'] = df['properties'].apply(lambda x: check_present(x))
print(df)
Output:
properties flag
0 FinancialOffice FALSE
1 Gas Station FALSE
2 Office TRUE
3 K-12 School FALSE
4 Commercial, Office TRUE
You can use split()
and strip()
to convert each properties
string of comma-delimited properties to a list of strings, then use the python set
intersection operator &
to test whether any of the properties match those in proplist
:
propset = set(proplist)
df['flag'] = (
df.properties.str.split(',')
.apply(lambda x: len({s.strip() for s in x} & propset) > 0))
Output:
properties flag
0 FinancialOffice False
1 Gas Station False
2 Office True
3 K-12 School False
4 Commercial, Office True