Splitting based on condtions

Question:

Say I have df as follows:

MyCol
Red Motor
Green Taxi 
Light blue small Taxi  
Light blue big Taxi 

I would like to split the color and the vehicle into two columns. I used this command to split the last word. But sometimes, there is a ‘big’ or ‘small’ associated with the car name. How can do the splitting with conditions?

df[['color','vehicle']] = df.myCol.str.rsplit(pat=' ', n=1, expand=True)
Asked By: test tes

||

Answers:

I think the best approach is to use extract with a regex pattern

df['MyCol'].str.extract('^(.*?)s((?:small|big)?s?w+)$')

            0           1
0         Red       Motor
1       Green        Taxi
2  Light blue  small Taxi
3  Light blue    big Taxi

Regex details:

  • ^: Matches start of the string
  • (.*?): first capturing group
    • .*?: matches any character zero or more times but as few times as possible (lazy match)
  • s: Matches the space
  • ((?:small|big)?s?w+): Second capturing group
    • (?:small|big)? : matches small or big zero or one time
    • s?: matches space zero or one time
    • w+: matches word characters oner or more times
  • $: matches end of the string

The Series.str.extract is used here to extracts two groups using a regular expression. The first group is before a whitespace and the second group is after the whitespace. The second group may contain the word "small" or "big" and returns a new DataFrame with two columns containing the extracted groups.

Answered By: Shubham Sharma
import pandas as pd

# create the dataframe
data = {'MyCol': ['Red Motor', 'Green Taxi', 'Light blue small Taxi', 'Light blue big Taxi']}
df = pd.DataFrame(data)

# create new columns for color and vehicle
df['color'] = ''
df['vehicle'] = ''

# iterate through rows of the dataframe
for i, row in df.iterrows():
    words = row['MyCol'].split()
    if words[-1] == 'big' or words[-1] == 'small':
        # if last word is 'big' or 'small'
        df.at[i, 'color'] = ' '.join(words[:-2])
        df.at[i, 'vehicle'] = words[-2] + ' ' + words[-1]
    else:
        # if last word is not 'big' or 'small'
        df.at[i, 'color'] = ' '.join(words[:-1])
        df.at[i, 'vehicle'] = words[-1]

# print the resulting dataframe
print(df)

str.split() method to split the string into words, then it checks if the last word is "big" or "small" and assigns the color and vehicle accordingly.

I could have done it with regular expression but it has some annoying test cases.

Answered By: Suren
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.