regex expression extract string after last comma with no numbers
Question:
Given a dataframe A
that looks like this:
id
information
001
Yellow, in town, John
002
Green, home, Lia 33
003
Yellow, garden, Peter2543
004
Red, 23 garden, 004 John891
005
Red, home, 245Sarah
006
Red 2, park 28, 67 Luke
007
Purple 03, to the beach, Mary Rose 9855
…
…
I want to create a new column called name
by extracting the name from information
, without numbers. That is:
id
information
name
001
Yellow, in town, John
John
002
Green, wardrobe, home, Lia 33
Lia
003
Yellow, garden, Peter2543
Peter
004
Red, 23 garden, 004 John891
John
005
Red, hat, home, 245Sarah
Sarah
006
Red 2, park 28, 67 Luke
Luke
007
Purple 03, to the beach, Mary Rose 9855
Mary Rose
…
…
…
Notice that:
- the name in
information
is always after the last comma
- the name may have numbers after and/or before the name
- these numbers can be separated by a white space or not
- the name can be two words (i.e. Mary Rose)
If I do:
A['name'] = A['information'].apply(lambda x: x.rsplit(',', 1)[1] if ',' in x else x)
it returns everything after the last comma (i.e: John, Lia 33, Peter 2543,…). But I need to only get the name.
I guess I have to use re.split()
instead but I cannot figure out which should be the regex expression…
Answers:
You can use
import pandas as pd
df = pd.DataFrame({"information":["Yellow, in town, John","Green, home, Lia 33","Yellow, garden, Peter2543","Red, 23 garden, 004 John891","Red, home, 245Sarah","Red 2, park 28, 67 Luke","Purple 03, to the beach, Mary Rose 9855"]})
df['name'] = df['information'].str.extract(r'.*,s*(?:d+s*)?([^d,]+?)(?:s*d+)?$', expand=False)
Output:
>>> df['information'].str.extract(r'.*,s*(?:d+s*)?([^d,]+?)(?:s*d+)?$', expand=False)
0 John
1 Lia
2 Peter
3 John
4 Sarah
5 Luke
6 Mary Rose
Name: information, dtype: object
Details:
.*,
– any zero or more chars other than line break chars as many as possible, and then a ,
char
s*
– zero or more whitespaces
(?:d+s*)?
– an optional sequence of one or more digits and then zero or more whitespaces
([^d,]+?)
– Group 1: one or more chars other than digits and comma, as few as possible
(?:s*d+)?
– an optional sequence of zero or more whitespaces and then one or more digits
$
– end of string.
Given a dataframe A
that looks like this:
id | information |
---|---|
001 | Yellow, in town, John |
002 | Green, home, Lia 33 |
003 | Yellow, garden, Peter2543 |
004 | Red, 23 garden, 004 John891 |
005 | Red, home, 245Sarah |
006 | Red 2, park 28, 67 Luke |
007 | Purple 03, to the beach, Mary Rose 9855 |
… | … |
I want to create a new column called name
by extracting the name from information
, without numbers. That is:
id | information | name |
---|---|---|
001 | Yellow, in town, John | John |
002 | Green, wardrobe, home, Lia 33 | Lia |
003 | Yellow, garden, Peter2543 | Peter |
004 | Red, 23 garden, 004 John891 | John |
005 | Red, hat, home, 245Sarah | Sarah |
006 | Red 2, park 28, 67 Luke | Luke |
007 | Purple 03, to the beach, Mary Rose 9855 | Mary Rose |
… | … | … |
Notice that:
- the name in
information
is always after the last comma - the name may have numbers after and/or before the name
- these numbers can be separated by a white space or not
- the name can be two words (i.e. Mary Rose)
If I do:
A['name'] = A['information'].apply(lambda x: x.rsplit(',', 1)[1] if ',' in x else x)
it returns everything after the last comma (i.e: John, Lia 33, Peter 2543,…). But I need to only get the name.
I guess I have to use re.split()
instead but I cannot figure out which should be the regex expression…
You can use
import pandas as pd
df = pd.DataFrame({"information":["Yellow, in town, John","Green, home, Lia 33","Yellow, garden, Peter2543","Red, 23 garden, 004 John891","Red, home, 245Sarah","Red 2, park 28, 67 Luke","Purple 03, to the beach, Mary Rose 9855"]})
df['name'] = df['information'].str.extract(r'.*,s*(?:d+s*)?([^d,]+?)(?:s*d+)?$', expand=False)
Output:
>>> df['information'].str.extract(r'.*,s*(?:d+s*)?([^d,]+?)(?:s*d+)?$', expand=False)
0 John
1 Lia
2 Peter
3 John
4 Sarah
5 Luke
6 Mary Rose
Name: information, dtype: object
Details:
.*,
– any zero or more chars other than line break chars as many as possible, and then a,
chars*
– zero or more whitespaces(?:d+s*)?
– an optional sequence of one or more digits and then zero or more whitespaces([^d,]+?)
– Group 1: one or more chars other than digits and comma, as few as possible(?:s*d+)?
– an optional sequence of zero or more whitespaces and then one or more digits$
– end of string.