How to categorize a column with regex patterns
Question:
- My question is how to put some value in a new column, based on the content in another column.
- In my specific case, I have a dataframe, with a column named
'Flop'
, that has string values in 3 different category
- I can find these ‘categories’ with regex, and based on each category, I wanna create another column called
'Suitedness'
with the name of each category.
A example of my df is:
import pandas as pd
df = pd.DataFrame()
df['Flop']=['As 5d 7c','As 9s 3s','8c 7d 5s','8d, As, Js','Qs Ts 8d','7s 2s 2d']
Initial dataframe
Flop
As 5d 7c
As 9s 3s
8c 7d 5s
8d, As, Js
Qs Ts 8d
7s 2s 2d
I solve the problem in this way:
Monotone = df[df['Flop'].str.contains('(wss){2}ws',na=False)]
Monotone['Suitedness']= 'Monotone'
Rainbow = df[df['Flop'].str.contains('(wcs.*)+|(w.swc.*)+|(w[s,d,c]sw[s,d,c]swc)+',na=False)]
Rainbow['Suitedness']= 'Rainbow'
DoubleSuited = df[df['Flop'].str.contains('((wss){2}w[d,c])+|(wssw[d,c]sws)+|(w[d,c]swssws)+',na=False)]
DoubleSuited['Suitedness']= 'Double Suited'
df2 = pd.concat([Monotone,Rainbow,DoubleSuited])
df2 = df2.sort_index()
- This code creates 3 different dataframes, and concatenates them.
- This solution works, but is inelegant.
- I’m looking for a cleaner solution.
- As well, my regex syntax is a little messy.
- The 3 categories are based on the letter ‘s’, 1, 2 or 3 ‘s’
- I’d also like tips on better regex syntax.
Final dataframe
Flop Suitedness
As 5d 7c Rainbow
As 9s 3s Monotone
8c 7d 5s Rainbow
Qs Ts 8d Double Suited
7s 2s 2d Double Suited
Answers:
- Using your sample data
- This solution doesn’t alter the regular expressions being used, it only streamlines setting the
'Suitedness'
of each string in 'Flop'
- See the SO: Regex Tag Wiki for ideas to make the regular expressions more efficient
- Visit regex101 to test your regular expressions.
- Create a dictionary with your regulars expressions and associated phrases
- Use pandas.Series.apply with a list comprehension, which returns a list with the correct
Suitedness
or an empty list if there’s not match with re.match
.
- With the expectation that there will only be a single match, or no match, pandas.Series.explode is used to return the value at index 0.
- A list index selection won’t work for cases where the list is empty (e.g.
[][0]
) because it results in an IndexError
- If you are not concerned with
NaN
values, use df = df.dropna()
to remove those rows.
import pandas as pd
import re
# create a dict of mappings
mapping = {'(wss){2}ws': 'Monotone',
'(wcs.*)+|(w.swc.*)+|(w[s,d,c]sw[s,d,c]swc)+': 'Rainbow',
'((wss){2}w[d,c])+|(wssw[d,c]sws)+|(w[d,c]swssws)+': 'Double Suited'}
# apply a list comprehension
df['Suitedness'] = df.Flop.apply(lambda x: [v for k, v in mapping.items() if re.match(k, x)]).explode()
# display(df)
Flop Suitedness
As 5d 7c Rainbow
As 9s 3s Monotone
8c 7d 5s Rainbow
8d, As, Js NaN
Qs Ts 8d Double Suited
7s 2s 2d Double Suited
- My question is how to put some value in a new column, based on the content in another column.
- In my specific case, I have a dataframe, with a column named
'Flop'
, that has string values in 3 different category - I can find these ‘categories’ with regex, and based on each category, I wanna create another column called
'Suitedness'
with the name of each category.
A example of my df is:
import pandas as pd
df = pd.DataFrame()
df['Flop']=['As 5d 7c','As 9s 3s','8c 7d 5s','8d, As, Js','Qs Ts 8d','7s 2s 2d']
Initial dataframe
Flop
As 5d 7c
As 9s 3s
8c 7d 5s
8d, As, Js
Qs Ts 8d
7s 2s 2d
I solve the problem in this way:
Monotone = df[df['Flop'].str.contains('(wss){2}ws',na=False)]
Monotone['Suitedness']= 'Monotone'
Rainbow = df[df['Flop'].str.contains('(wcs.*)+|(w.swc.*)+|(w[s,d,c]sw[s,d,c]swc)+',na=False)]
Rainbow['Suitedness']= 'Rainbow'
DoubleSuited = df[df['Flop'].str.contains('((wss){2}w[d,c])+|(wssw[d,c]sws)+|(w[d,c]swssws)+',na=False)]
DoubleSuited['Suitedness']= 'Double Suited'
df2 = pd.concat([Monotone,Rainbow,DoubleSuited])
df2 = df2.sort_index()
- This code creates 3 different dataframes, and concatenates them.
- This solution works, but is inelegant.
- I’m looking for a cleaner solution.
- As well, my regex syntax is a little messy.
- The 3 categories are based on the letter ‘s’, 1, 2 or 3 ‘s’
- I’d also like tips on better regex syntax.
Final dataframe
Flop Suitedness
As 5d 7c Rainbow
As 9s 3s Monotone
8c 7d 5s Rainbow
Qs Ts 8d Double Suited
7s 2s 2d Double Suited
- Using your sample data
- This solution doesn’t alter the regular expressions being used, it only streamlines setting the
'Suitedness'
of each string in'Flop'
- See the SO: Regex Tag Wiki for ideas to make the regular expressions more efficient
- Visit regex101 to test your regular expressions.
- Create a dictionary with your regulars expressions and associated phrases
- Use pandas.Series.apply with a list comprehension, which returns a list with the correct
Suitedness
or an empty list if there’s not match withre.match
.- With the expectation that there will only be a single match, or no match, pandas.Series.explode is used to return the value at index 0.
- A list index selection won’t work for cases where the list is empty (e.g.
[][0]
) because it results in anIndexError
- A list index selection won’t work for cases where the list is empty (e.g.
- With the expectation that there will only be a single match, or no match, pandas.Series.explode is used to return the value at index 0.
- If you are not concerned with
NaN
values, usedf = df.dropna()
to remove those rows.
import pandas as pd
import re
# create a dict of mappings
mapping = {'(wss){2}ws': 'Monotone',
'(wcs.*)+|(w.swc.*)+|(w[s,d,c]sw[s,d,c]swc)+': 'Rainbow',
'((wss){2}w[d,c])+|(wssw[d,c]sws)+|(w[d,c]swssws)+': 'Double Suited'}
# apply a list comprehension
df['Suitedness'] = df.Flop.apply(lambda x: [v for k, v in mapping.items() if re.match(k, x)]).explode()
# display(df)
Flop Suitedness
As 5d 7c Rainbow
As 9s 3s Monotone
8c 7d 5s Rainbow
8d, As, Js NaN
Qs Ts 8d Double Suited
7s 2s 2d Double Suited