How to extract last name while avoiding roman numerals
Question:
How to extract only last name (including hyphenated double last names) without roman numerals or other spaces or character?
String in Pandas dataframe representing person’s full name can take the following forms:
Jon Doe
Jon A. Doe
Jon Anderson Doe
Jon Doe II
Jon Doe, IV
Jon A. Doe, V
Jon A. Doe X
Jon Anderson Doe, VI
Jon Anderson Doe VII
Jon Anderson Doe-Stapleton VII
Jon Anderson Doe-Stapleton, VII
Jon Anderson Doe-Stapleton
Is regex a good solution? I’m obviously a novice, but would like an efficient solution.
Thank you for your help!
Answers:
Try this to remove the roman numerals and comma:
import re
x = """Jon Doe
Jon A. Doe
Jon Anderson Doe
Jon Doe II
Jon Doe, IV
Jon A. Doe, V
Jon A. Doe X
Jon Anderson Doe, VI
Jon Anderson Doe VII
Jon Anderson Doe-Stapleton VII
Jon Anderson Doe-Stapleton, VII
Jon Anderson Doe-Stapleton""".split('n')
for s in x:
print(re.sub(r"(,.)?(M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3}))$", "", s))
[out]:
Jon Doe
Jon A. Doe
Jon Anderson Doe
Jon Doe
Jon Doe
Jon A. Doe
Jon A. Doe
Jon Anderson Doe
Jon Anderson Doe
Jon Anderson Doe-Stapleton
Jon Anderson Doe-Stapleton
Jon Anderson Doe-Stapleton
Regex explanation: https://regex101.com/r/xeZpBD/1
Why do you need a complex regex for the roman numerals?
See https://regexr.com/3a406, cos not all IVXLCDM
are valid roman numerals.
- https://www.geeksforgeeks.org/validating-roman-numerals-using-regular-expression/
- https://www.oreilly.com/library/view/regular-expressions-cookbook/9780596802837/ch06s09.html
But how do we remove the last name?
Depends on how it’s defined. If it’s just the last token from the names, then you can just do this:
for s in x:
print(re.sub(r"(,.)?(M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3}))$", "", s).strip().split(' ')[-1])
[out]:
Doe
Doe
Doe
Doe
Doe
Doe
Doe
Doe
Doe
Doe-Stapleton
Doe-Stapleton
Doe-Stapleton
What if last name isn’t a single token/word?
E.g. https://en.wikipedia.org/wiki/Double-barrelled_name
The rugby player Rohan Janse van Rensburg‘s surname is Janse van Rensburg, not only van Rensburg (which is itself an existing surname).
or
Andrew Lloyd Webber, Baron Lloyd-Webber Kt (born 22 March 1948), is an English composer and impresario of musical theatre.
Shrugs, you need something more than regex for this, maybe a last name list?
An optimized approach with the re
module:
import re
strings = """
Jon Doe
Jon A. Doe
Jon Anderson Doe
Jon Doe II
Jon Doe, IV
Jon A. Doe, V
Jon A. Doe X
Jon Anderson Doe, VI
Jon Anderson Doe VII
Jon Anderson Doe-Stapleton VII
Jon Anderson Doe-Stapleton, VII
Jon Anderson Doe-Stapleton
"""
_ROMAN_NUMERAL_RE = re.compile(r'^[IVXLCDM]+$')
for name in strings.strip().splitlines():
words = name.rsplit(' ', maxsplit=2)
last_name = words[-2].rstrip(',') if _ROMAN_NUMERAL_RE.match(last := words[-1]) else last
print(f'{name.ljust(35)} - [ {last_name} ]')
Result:
Jon Doe - [ Doe ]
Jon A. Doe - [ Doe ]
Jon Anderson Doe - [ Doe ]
Jon Doe II - [ Doe ]
Jon Doe, IV - [ Doe ]
Jon A. Doe, V - [ Doe ]
Jon A. Doe X - [ Doe ]
Jon Anderson Doe, VI - [ Doe ]
Jon Anderson Doe VII - [ Doe ]
Jon Anderson Doe-Stapleton VII - [ Doe-Stapleton ]
Jon Anderson Doe-Stapleton, VII - [ Doe-Stapleton ]
Jon Anderson Doe-Stapleton - [ Doe-Stapleton ]
Performance
This version is actually much faster (more than 5x) than a more complex approach with re
.
See benchmark tests with timeit
below:
import re
from timeit import timeit
strings = """
Jon Doe
Jon A. Doe
Jon Anderson Doe
Jon Doe II
Jon Doe, IV
Jon A. Doe, V
Jon A. Doe X
Jon Anderson Doe, VI
Jon Anderson Doe VII
Jon Anderson Doe-Stapleton VII
Jon Anderson Doe-Stapleton, VII
Jon Anderson Doe-Stapleton
"""
_ROMAN_NUMERAL_RE = re.compile(r'^[IVXLCDM]+$')
def last_name(name: str):
words = name.rsplit(' ', maxsplit=2)
return words[-2].rstrip(',') if _ROMAN_NUMERAL_RE.match(last := words[-1]) else last
def last_name_re(name: str):
return re.sub(r"(,.)?(M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3}))$", "", name).strip().split(' ')[-1]
# for name in strings.strip().splitlines():
# print(f'{name.ljust(35)} - [ {last_name_re(name)} ]')
n = 100_000
print('re: ', timeit("""
for name in strings.strip().splitlines():
last_name(name)
""", globals=globals(), number=n))
print('re (complex): ', timeit("""
for name in strings.strip().splitlines():
last_name_re(name)
""", globals=globals(), number=n))
Results on my Mac:
re: 0.42575412476435304
re (complex): 3.725017166696489
How to extract only last name (including hyphenated double last names) without roman numerals or other spaces or character?
String in Pandas dataframe representing person’s full name can take the following forms:
Jon Doe
Jon A. Doe
Jon Anderson Doe
Jon Doe II
Jon Doe, IV
Jon A. Doe, V
Jon A. Doe X
Jon Anderson Doe, VI
Jon Anderson Doe VII
Jon Anderson Doe-Stapleton VII
Jon Anderson Doe-Stapleton, VII
Jon Anderson Doe-Stapleton
Is regex a good solution? I’m obviously a novice, but would like an efficient solution.
Thank you for your help!
Try this to remove the roman numerals and comma:
import re
x = """Jon Doe
Jon A. Doe
Jon Anderson Doe
Jon Doe II
Jon Doe, IV
Jon A. Doe, V
Jon A. Doe X
Jon Anderson Doe, VI
Jon Anderson Doe VII
Jon Anderson Doe-Stapleton VII
Jon Anderson Doe-Stapleton, VII
Jon Anderson Doe-Stapleton""".split('n')
for s in x:
print(re.sub(r"(,.)?(M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3}))$", "", s))
[out]:
Jon Doe
Jon A. Doe
Jon Anderson Doe
Jon Doe
Jon Doe
Jon A. Doe
Jon A. Doe
Jon Anderson Doe
Jon Anderson Doe
Jon Anderson Doe-Stapleton
Jon Anderson Doe-Stapleton
Jon Anderson Doe-Stapleton
Regex explanation: https://regex101.com/r/xeZpBD/1
Why do you need a complex regex for the roman numerals?
See https://regexr.com/3a406, cos not all IVXLCDM
are valid roman numerals.
- https://www.geeksforgeeks.org/validating-roman-numerals-using-regular-expression/
- https://www.oreilly.com/library/view/regular-expressions-cookbook/9780596802837/ch06s09.html
But how do we remove the last name?
Depends on how it’s defined. If it’s just the last token from the names, then you can just do this:
for s in x:
print(re.sub(r"(,.)?(M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3}))$", "", s).strip().split(' ')[-1])
[out]:
Doe
Doe
Doe
Doe
Doe
Doe
Doe
Doe
Doe
Doe-Stapleton
Doe-Stapleton
Doe-Stapleton
What if last name isn’t a single token/word?
E.g. https://en.wikipedia.org/wiki/Double-barrelled_name
The rugby player Rohan Janse van Rensburg‘s surname is Janse van Rensburg, not only van Rensburg (which is itself an existing surname).
or
Andrew Lloyd Webber, Baron Lloyd-Webber Kt (born 22 March 1948), is an English composer and impresario of musical theatre.
Shrugs, you need something more than regex for this, maybe a last name list?
An optimized approach with the re
module:
import re
strings = """
Jon Doe
Jon A. Doe
Jon Anderson Doe
Jon Doe II
Jon Doe, IV
Jon A. Doe, V
Jon A. Doe X
Jon Anderson Doe, VI
Jon Anderson Doe VII
Jon Anderson Doe-Stapleton VII
Jon Anderson Doe-Stapleton, VII
Jon Anderson Doe-Stapleton
"""
_ROMAN_NUMERAL_RE = re.compile(r'^[IVXLCDM]+$')
for name in strings.strip().splitlines():
words = name.rsplit(' ', maxsplit=2)
last_name = words[-2].rstrip(',') if _ROMAN_NUMERAL_RE.match(last := words[-1]) else last
print(f'{name.ljust(35)} - [ {last_name} ]')
Result:
Jon Doe - [ Doe ]
Jon A. Doe - [ Doe ]
Jon Anderson Doe - [ Doe ]
Jon Doe II - [ Doe ]
Jon Doe, IV - [ Doe ]
Jon A. Doe, V - [ Doe ]
Jon A. Doe X - [ Doe ]
Jon Anderson Doe, VI - [ Doe ]
Jon Anderson Doe VII - [ Doe ]
Jon Anderson Doe-Stapleton VII - [ Doe-Stapleton ]
Jon Anderson Doe-Stapleton, VII - [ Doe-Stapleton ]
Jon Anderson Doe-Stapleton - [ Doe-Stapleton ]
Performance
This version is actually much faster (more than 5x) than a more complex approach with re
.
See benchmark tests with timeit
below:
import re
from timeit import timeit
strings = """
Jon Doe
Jon A. Doe
Jon Anderson Doe
Jon Doe II
Jon Doe, IV
Jon A. Doe, V
Jon A. Doe X
Jon Anderson Doe, VI
Jon Anderson Doe VII
Jon Anderson Doe-Stapleton VII
Jon Anderson Doe-Stapleton, VII
Jon Anderson Doe-Stapleton
"""
_ROMAN_NUMERAL_RE = re.compile(r'^[IVXLCDM]+$')
def last_name(name: str):
words = name.rsplit(' ', maxsplit=2)
return words[-2].rstrip(',') if _ROMAN_NUMERAL_RE.match(last := words[-1]) else last
def last_name_re(name: str):
return re.sub(r"(,.)?(M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3}))$", "", name).strip().split(' ')[-1]
# for name in strings.strip().splitlines():
# print(f'{name.ljust(35)} - [ {last_name_re(name)} ]')
n = 100_000
print('re: ', timeit("""
for name in strings.strip().splitlines():
last_name(name)
""", globals=globals(), number=n))
print('re (complex): ', timeit("""
for name in strings.strip().splitlines():
last_name_re(name)
""", globals=globals(), number=n))
Results on my Mac:
re: 0.42575412476435304
re (complex): 3.725017166696489