Splitting based on condtions


Say I have df as follows:

Red Motor
Green Taxi 
Light blue small Taxi  
Light blue big Taxi 

I would like to split the color and the vehicle into two columns. I used this command to split the last word. But sometimes, there is a ‘big’ or ‘small’ associated with the car name. How can do the splitting with conditions?

df[['color','vehicle']] = df.myCol.str.rsplit(pat=' ', n=1, expand=True)
Asked By: test tes



I think the best approach is to use extract with a regex pattern


            0           1
0         Red       Motor
1       Green        Taxi
2  Light blue  small Taxi
3  Light blue    big Taxi

Regex details:

  • ^: Matches start of the string
  • (.*?): first capturing group
    • .*?: matches any character zero or more times but as few times as possible (lazy match)
  • s: Matches the space
  • ((?:small|big)?s?w+): Second capturing group
    • (?:small|big)? : matches small or big zero or one time
    • s?: matches space zero or one time
    • w+: matches word characters oner or more times
  • $: matches end of the string

The Series.str.extract is used here to extracts two groups using a regular expression. The first group is before a whitespace and the second group is after the whitespace. The second group may contain the word "small" or "big" and returns a new DataFrame with two columns containing the extracted groups.

Answered By: Shubham Sharma
import pandas as pd

# create the dataframe
data = {'MyCol': ['Red Motor', 'Green Taxi', 'Light blue small Taxi', 'Light blue big Taxi']}
df = pd.DataFrame(data)

# create new columns for color and vehicle
df['color'] = ''
df['vehicle'] = ''

# iterate through rows of the dataframe
for i, row in df.iterrows():
    words = row['MyCol'].split()
    if words[-1] == 'big' or words[-1] == 'small':
        # if last word is 'big' or 'small'
        df.at[i, 'color'] = ' '.join(words[:-2])
        df.at[i, 'vehicle'] = words[-2] + ' ' + words[-1]
        # if last word is not 'big' or 'small'
        df.at[i, 'color'] = ' '.join(words[:-1])
        df.at[i, 'vehicle'] = words[-1]

# print the resulting dataframe

str.split() method to split the string into words, then it checks if the last word is "big" or "small" and assigns the color and vehicle accordingly.

I could have done it with regular expression but it has some annoying test cases.

Answered By: Suren
