regex expression extract string after last comma with no numbers

Question:

Given a dataframe A that looks like this:

id information
001 Yellow, in town, John
002 Green, home, Lia 33
003 Yellow, garden, Peter2543
004 Red, 23 garden, 004 John891
005 Red, home, 245Sarah
006 Red 2, park 28, 67 Luke
007 Purple 03, to the beach, Mary Rose 9855

I want to create a new column called name by extracting the name from information, without numbers. That is:

id information name
001 Yellow, in town, John John
002 Green, wardrobe, home, Lia 33 Lia
003 Yellow, garden, Peter2543 Peter
004 Red, 23 garden, 004 John891 John
005 Red, hat, home, 245Sarah Sarah
006 Red 2, park 28, 67 Luke Luke
007 Purple 03, to the beach, Mary Rose 9855 Mary Rose

Notice that:

  • the name in information is always after the last comma
  • the name may have numbers after and/or before the name
  • these numbers can be separated by a white space or not
  • the name can be two words (i.e. Mary Rose)

If I do:

A['name'] = A['information'].apply(lambda x: x.rsplit(',', 1)[1] if ',' in x else x)

it returns everything after the last comma (i.e: John, Lia 33, Peter 2543,…). But I need to only get the name.

I guess I have to use re.split() instead but I cannot figure out which should be the regex expression…

Asked By: LRD

||

Answers:

You can use

import pandas as pd
df = pd.DataFrame({"information":["Yellow, in town, John","Green, home, Lia 33","Yellow, garden, Peter2543","Red, 23 garden, 004 John891","Red, home, 245Sarah","Red 2, park 28, 67 Luke","Purple 03, to the beach, Mary Rose 9855"]})
df['name'] = df['information'].str.extract(r'.*,s*(?:d+s*)?([^d,]+?)(?:s*d+)?$', expand=False)

Output:

>>> df['information'].str.extract(r'.*,s*(?:d+s*)?([^d,]+?)(?:s*d+)?$', expand=False)
0         John
1          Lia
2        Peter
3         John
4        Sarah
5         Luke
6    Mary Rose
Name: information, dtype: object

Details:

  • .*, – any zero or more chars other than line break chars as many as possible, and then a , char
  • s* – zero or more whitespaces
  • (?:d+s*)? – an optional sequence of one or more digits and then zero or more whitespaces
  • ([^d,]+?) – Group 1: one or more chars other than digits and comma, as few as possible
  • (?:s*d+)? – an optional sequence of zero or more whitespaces and then one or more digits
  • $ – end of string.
Answered By: Wiktor Stribiżew
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.