How to create a column as a list of similar strings onto a new column?
Question:
I’ve been trying to get a new row in a pandas dataframe which encapsullates as a list all the similar strings into it’s original matching row.
This is the original pandas dataframe:
import pandas as pd
d = {'product_name': ['2 pack liner socks', '2 pack logo liner socks', 'b.bare Hipster', 'Lady BARE Hipster Panty'], 'id': [13, 12, 11, 10]}
df = pd.DataFrame(data=d)
I would like to get a dataframe that looks like this:
# product_name # id # group
2 pack liner socks 13 ['2 pack liner socks', '2 pack logo liner socks']
2 pack logo liner socks 12 ['2 pack liner socks', '2 pack logo liner socks']
b.bare Hipster 11 ['b.bare Hipster', 'Lady BARE Hipster Panty']
Lady BARE Hipster Panty 10 ['b.bare Hipster', 'Lady BARE Hipster Panty']
I tried the following:
import thefuzz
from thefuzz import process
df["group"] = df["product_name"].apply(lambda x: process.extractOne(x, df["product_name"], scorer=fuzz.partial_ratio)[0])
And it throws the next error:
NameError: name ‘fuzz’ is not defined
How could I fix this code or on the other hand are there any other approaches to solve this?
Answers:
You need to import fuzz – from thefuzz import process, fuzz
but using process.extractOne
with a list of all values in the product_name
will always return the actual value of that row because it is a 100% match so let’s filter that out by doing df["product_name"].loc[df['product_name'] != x]
from thefuzz import process, fuzz
df['group'] = df["product_name"].apply(lambda x: sorted([x, process.extractOne(x, df["product_name"].loc[df['product_name'] != x],
scorer=fuzz.partial_ratio)[0]]))
product_name id group
0 2 pack liner socks 13 [2 pack liner socks, 2 pack logo liner socks]
1 2 pack logo liner socks 12 [2 pack liner socks, 2 pack logo liner socks]
2 b.bare Hipster 11 [Lady BARE Hipster Panty, b.bare Hipster]
3 Lady BARE Hipster Panty 10 [Lady BARE Hipster Panty, b.bare Hipster]
I’ve been trying to get a new row in a pandas dataframe which encapsullates as a list all the similar strings into it’s original matching row.
This is the original pandas dataframe:
import pandas as pd
d = {'product_name': ['2 pack liner socks', '2 pack logo liner socks', 'b.bare Hipster', 'Lady BARE Hipster Panty'], 'id': [13, 12, 11, 10]}
df = pd.DataFrame(data=d)
I would like to get a dataframe that looks like this:
# product_name # id # group
2 pack liner socks 13 ['2 pack liner socks', '2 pack logo liner socks']
2 pack logo liner socks 12 ['2 pack liner socks', '2 pack logo liner socks']
b.bare Hipster 11 ['b.bare Hipster', 'Lady BARE Hipster Panty']
Lady BARE Hipster Panty 10 ['b.bare Hipster', 'Lady BARE Hipster Panty']
I tried the following:
import thefuzz
from thefuzz import process
df["group"] = df["product_name"].apply(lambda x: process.extractOne(x, df["product_name"], scorer=fuzz.partial_ratio)[0])
And it throws the next error:
NameError: name ‘fuzz’ is not defined
How could I fix this code or on the other hand are there any other approaches to solve this?
You need to import fuzz – from thefuzz import process, fuzz
but using process.extractOne
with a list of all values in the product_name
will always return the actual value of that row because it is a 100% match so let’s filter that out by doing df["product_name"].loc[df['product_name'] != x]
from thefuzz import process, fuzz
df['group'] = df["product_name"].apply(lambda x: sorted([x, process.extractOne(x, df["product_name"].loc[df['product_name'] != x],
scorer=fuzz.partial_ratio)[0]]))
product_name id group
0 2 pack liner socks 13 [2 pack liner socks, 2 pack logo liner socks]
1 2 pack logo liner socks 12 [2 pack liner socks, 2 pack logo liner socks]
2 b.bare Hipster 11 [Lady BARE Hipster Panty, b.bare Hipster]
3 Lady BARE Hipster Panty 10 [Lady BARE Hipster Panty, b.bare Hipster]