Splitting Pandas series with regex and word grouping constraints
Question:
Say I have df as follows:
MyCol
Red Motor
Blue Taxi
Green Taxi-1
Light blue small Taxi-1
Light blue big Taxi-2
I would like to split the color and the vehicle into two columns. I used this command to split the last word (could be any character).
The last word (could be any character, like taxi or taxi-1) refers to the vehicle. Sometimes, there is a ‘big’ or ‘small’ associated with the car name. The first few words (can be one or more than one words) refers to the color.
This is what I have tried. It only works when the last word is a word without special characters. How can I include the case when special characters in the last word too?
df['MyCol'].str.extract('^(.*?)s((?:small|big)?s?w+).*$')
Answers:
df['MyCol'].str.extract('^(.*?)s((?:small|big|)s?S+)$')
resulting in:
0
1
0
Red
Motor
1
Blue
Taxi
2
Green
Taxi-1
3
Light blue
small Taxi-1
4
Light blue
big Taxi
Say I have df as follows:
MyCol
Red Motor
Blue Taxi
Green Taxi-1
Light blue small Taxi-1
Light blue big Taxi-2
I would like to split the color and the vehicle into two columns. I used this command to split the last word (could be any character).
The last word (could be any character, like taxi or taxi-1) refers to the vehicle. Sometimes, there is a ‘big’ or ‘small’ associated with the car name. The first few words (can be one or more than one words) refers to the color.
This is what I have tried. It only works when the last word is a word without special characters. How can I include the case when special characters in the last word too?
df['MyCol'].str.extract('^(.*?)s((?:small|big)?s?w+).*$')
df['MyCol'].str.extract('^(.*?)s((?:small|big|)s?S+)$')
resulting in:
0 | 1 | |
---|---|---|
0 | Red | Motor |
1 | Blue | Taxi |
2 | Green | Taxi-1 |
3 | Light blue | small Taxi-1 |
4 | Light blue | big Taxi |