regex expression extract string after last comma with no numbers

Question

Given a dataframe A that looks like this:

id	information
001	Yellow, in town, John
002	Green, home, Lia 33
003	Yellow, garden, Peter2543
004	Red, 23 garden, 004 John891
005	Red, home, 245Sarah
006	Red 2, park 28, 67 Luke
007	Purple 03, to the beach, Mary Rose 9855
…	…

I want to create a new column called name by extracting the name from information, without numbers. That is:

id	information	name
001	Yellow, in town, John	John
002	Green, wardrobe, home, Lia 33	Lia
003	Yellow, garden, Peter2543	Peter
004	Red, 23 garden, 004 John891	John
005	Red, hat, home, 245Sarah	Sarah
006	Red 2, park 28, 67 Luke	Luke
007	Purple 03, to the beach, Mary Rose 9855	Mary Rose
…	…	…

Notice that:

the name in information is always after the last comma
the name may have numbers after and/or before the name
these numbers can be separated by a white space or not
the name can be two words (i.e. Mary Rose)

If I do:

A['name'] = A['information'].apply(lambda x: x.rsplit(',', 1)[1] if ',' in x else x)

it returns everything after the last comma (i.e: John, Lia 33, Peter 2543,…). But I need to only get the name.

I guess I have to use re.split() instead but I cannot figure out which should be the regex expression…

Asked By: LRD

||

Source

Answer 1

You can use

import pandas as pd
df = pd.DataFrame({"information":["Yellow, in town, John","Green, home, Lia 33","Yellow, garden, Peter2543","Red, 23 garden, 004 John891","Red, home, 245Sarah","Red 2, park 28, 67 Luke","Purple 03, to the beach, Mary Rose 9855"]})
df['name'] = df['information'].str.extract(r'.*,s*(?:d+s*)?([^d,]+?)(?:s*d+)?$', expand=False)

Output:

>>> df['information'].str.extract(r'.*,s*(?:d+s*)?([^d,]+?)(?:s*d+)?$', expand=False)
0         John
1          Lia
2        Peter
3         John
4        Sarah
5         Luke
6    Mary Rose
Name: information, dtype: object

Details:

.*, – any zero or more chars other than line break chars as many as possible, and then a , char
s* – zero or more whitespaces
(?:d+s*)? – an optional sequence of one or more digits and then zero or more whitespaces
([^d,]+?) – Group 1: one or more chars other than digits and comma, as few as possible
(?:s*d+)? – an optional sequence of zero or more whitespaces and then one or more digits
$ – end of string.

Answered By: Wiktor Stribiżew

regex expression extract string after last comma with no numbers

Question:

Answers: