How to categorize a column with regex patterns

Question:

  • My question is how to put some value in a new column, based on the content in another column.
  • In my specific case, I have a dataframe, with a column named 'Flop', that has string values in 3 different category
  • I can find these ‘categories’ with regex, and based on each category, I wanna create another column called 'Suitedness' with the name of each category.

A example of my df is:

import pandas as pd
df = pd.DataFrame()
df['Flop']=['As 5d 7c','As 9s 3s','8c 7d 5s','8d, As, Js','Qs Ts 8d','7s 2s 2d']

Initial dataframe

       Flop
   As 5d 7c
   As 9s 3s
   8c 7d 5s
 8d, As, Js
   Qs Ts 8d
   7s 2s 2d

I solve the problem in this way:

Monotone = df[df['Flop'].str.contains('(wss){2}ws',na=False)]
Monotone['Suitedness']= 'Monotone'
Rainbow = df[df['Flop'].str.contains('(wcs.*)+|(w.swc.*)+|(w[s,d,c]sw[s,d,c]swc)+',na=False)]
Rainbow['Suitedness']= 'Rainbow'
DoubleSuited = df[df['Flop'].str.contains('((wss){2}w[d,c])+|(wssw[d,c]sws)+|(w[d,c]swssws)+',na=False)]
DoubleSuited['Suitedness']= 'Double Suited'
df2 = pd.concat([Monotone,Rainbow,DoubleSuited])
df2 = df2.sort_index()
  • This code creates 3 different dataframes, and concatenates them.
    • This solution works, but is inelegant.
    • I’m looking for a cleaner solution.
  • As well, my regex syntax is a little messy.
    • The 3 categories are based on the letter ‘s’, 1, 2 or 3 ‘s’
    • I’d also like tips on better regex syntax.

Final dataframe

     Flop     Suitedness
 As 5d 7c        Rainbow
 As 9s 3s       Monotone
 8c 7d 5s        Rainbow
 Qs Ts 8d  Double Suited
 7s 2s 2d  Double Suited
Asked By: Vinicius Collaco

||

Answers:

  • Using your sample data
  • This solution doesn’t alter the regular expressions being used, it only streamlines setting the 'Suitedness' of each string in 'Flop'
    • See the SO: Regex Tag Wiki for ideas to make the regular expressions more efficient
    • Visit regex101 to test your regular expressions.
  • Create a dictionary with your regulars expressions and associated phrases
  • Use pandas.Series.apply with a list comprehension, which returns a list with the correct Suitedness or an empty list if there’s not match with re.match.
    • With the expectation that there will only be a single match, or no match, pandas.Series.explode is used to return the value at index 0.
      • A list index selection won’t work for cases where the list is empty (e.g. [][0]) because it results in an IndexError
  • If you are not concerned with NaN values, use df = df.dropna() to remove those rows.
import pandas as pd
import re

# create a dict of mappings
mapping = {'(wss){2}ws': 'Monotone',
           '(wcs.*)+|(w.swc.*)+|(w[s,d,c]sw[s,d,c]swc)+': 'Rainbow',
           '((wss){2}w[d,c])+|(wssw[d,c]sws)+|(w[d,c]swssws)+': 'Double Suited'}

# apply a list comprehension
df['Suitedness'] = df.Flop.apply(lambda x: [v for k, v in mapping.items() if re.match(k, x)]).explode()

# display(df)
       Flop     Suitedness
   As 5d 7c        Rainbow
   As 9s 3s       Monotone
   8c 7d 5s        Rainbow
 8d, As, Js            NaN
   Qs Ts 8d  Double Suited
   7s 2s 2d  Double Suited
Answered By: Trenton McKinney
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.