String Matching with dictionary key in python
Question:
I have one list of string and one dictionary. For eg:
list = ["apple fell on Newton", "lemon is yellow","grass is greener"]
dict = {"apple" : "fruits", "lemon" : "vegetable"}
Task is to match each string from list with the key of dictionary. If it matches then return the value of the key.
Currently, I am using this approach which is very time consuming. Can someone please help me out with any efficient technique ?
lmb_extract_type = (lambda post: list(filter(None, set(dict.get(w)[0] if w in post.lower().split() else None for w in dict))))
df['type'] = df[list].apply(lmb_extract_type)
Answers:
It is a single column with a string (eg.: "apple fell on Newton") in each row of the data frame. For each row, I have to match it with key from the dictionary and return value of the corresponding key
Number of elements in the list is around 40-50 million.So, its taking a lot of time
IIUC, based on your comments, you can solve this easily with a str.extract
and series.replace
, both of which are vectorized functions without any loops.
- For using str.extract, you can create a regex pattern from the keys of the dictionary. This only extracts the keywords apple or lemon.
- You can use the dictionary d to then simply replace each of these directly with the corresponding values
l = ["apple fell on Newton", "lemon is yellow","grass is greener"]
d = {"apple" : "fruits", "lemon" : "vegetable"}
df = pd.DataFrame(l, columns=['sentences']) #Single column dataframe to demonstrate.
pattern = '('+'|'.join(d.keys())+')' #Regular expression pattern
df['type'] = df.sentences.str.extract(pattern).replace(d)
print(df)
sentences type
0 apple fell on Newton fruits
1 lemon is yellow vegetable
2 grass is greener NaN
Check by applying the lambda function and store the values in string in the dataframe.
df['New_Col'] = df['sentences'].apply(lambda l: ', '.join([key for key, value in d.items() if value in l]))
I have one list of string and one dictionary. For eg:
list = ["apple fell on Newton", "lemon is yellow","grass is greener"]
dict = {"apple" : "fruits", "lemon" : "vegetable"}
Task is to match each string from list with the key of dictionary. If it matches then return the value of the key.
Currently, I am using this approach which is very time consuming. Can someone please help me out with any efficient technique ?
lmb_extract_type = (lambda post: list(filter(None, set(dict.get(w)[0] if w in post.lower().split() else None for w in dict))))
df['type'] = df[list].apply(lmb_extract_type)
It is a single column with a string (eg.: "apple fell on Newton") in each row of the data frame. For each row, I have to match it with key from the dictionary and return value of the corresponding key
Number of elements in the list is around 40-50 million.So, its taking a lot of time
IIUC, based on your comments, you can solve this easily with a str.extract
and series.replace
, both of which are vectorized functions without any loops.
- For using str.extract, you can create a regex pattern from the keys of the dictionary. This only extracts the keywords apple or lemon.
- You can use the dictionary d to then simply replace each of these directly with the corresponding values
l = ["apple fell on Newton", "lemon is yellow","grass is greener"]
d = {"apple" : "fruits", "lemon" : "vegetable"}
df = pd.DataFrame(l, columns=['sentences']) #Single column dataframe to demonstrate.
pattern = '('+'|'.join(d.keys())+')' #Regular expression pattern
df['type'] = df.sentences.str.extract(pattern).replace(d)
print(df)
sentences type
0 apple fell on Newton fruits
1 lemon is yellow vegetable
2 grass is greener NaN
Check by applying the lambda function and store the values in string in the dataframe.
df['New_Col'] = df['sentences'].apply(lambda l: ', '.join([key for key, value in d.items() if value in l]))