How to extract last name while avoiding roman numerals

Question

How to extract only last name (including hyphenated double last names) without roman numerals or other spaces or character?

String in Pandas dataframe representing person’s full name can take the following forms:

Jon Doe
Jon A. Doe
Jon Anderson Doe
Jon Doe II
Jon Doe, IV
Jon A. Doe, V
Jon A. Doe X
Jon Anderson Doe, VI
Jon Anderson Doe VII
Jon Anderson Doe-Stapleton VII
Jon Anderson Doe-Stapleton, VII
Jon Anderson Doe-Stapleton

Is regex a good solution? I’m obviously a novice, but would like an efficient solution.

Thank you for your help!

Asked By: Rycliff

||

Source

Answer 1

Try this to remove the roman numerals and comma:

import re

x = """Jon Doe
Jon A. Doe
Jon Anderson Doe
Jon Doe II
Jon Doe, IV
Jon A. Doe, V
Jon A. Doe X
Jon Anderson Doe, VI
Jon Anderson Doe VII
Jon Anderson Doe-Stapleton VII
Jon Anderson Doe-Stapleton, VII
Jon Anderson Doe-Stapleton""".split('n')


for s in x:
  print(re.sub(r"(,.)?(M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3}))$", "", s))

[out]:

Jon Doe
Jon A. Doe
Jon Anderson Doe
Jon Doe 
Jon Doe
Jon A. Doe
Jon A. Doe 
Jon Anderson Doe
Jon Anderson Doe 
Jon Anderson Doe-Stapleton 
Jon Anderson Doe-Stapleton
Jon Anderson Doe-Stapleton

Regex explanation: https://regex101.com/r/xeZpBD/1

Why do you need a complex regex for the roman numerals?

See https://regexr.com/3a406, cos not all IVXLCDM are valid roman numerals.

But how do we remove the last name?

Depends on how it’s defined. If it’s just the last token from the names, then you can just do this:

for s in x:
  print(re.sub(r"(,.)?(M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3}))$", "", s).strip().split(' ')[-1])

[out]:

Doe
Doe
Doe
Doe
Doe
Doe
Doe
Doe
Doe
Doe-Stapleton
Doe-Stapleton
Doe-Stapleton

What if last name isn’t a single token/word?

E.g. https://en.wikipedia.org/wiki/Double-barrelled_name

The rugby player Rohan Janse van Rensburg‘s surname is Janse van Rensburg, not only van Rensburg (which is itself an existing surname).

or

Andrew Lloyd Webber, Baron Lloyd-Webber Kt (born 22 March 1948), is an English composer and impresario of musical theatre.

Shrugs, you need something more than regex for this, maybe a last name list?

Answered By: alvas

Answer 2

An optimized approach with the re module:

import re

strings = """
Jon Doe
Jon A. Doe
Jon Anderson Doe
Jon Doe II
Jon Doe, IV
Jon A. Doe, V
Jon A. Doe X
Jon Anderson Doe, VI
Jon Anderson Doe VII
Jon Anderson Doe-Stapleton VII
Jon Anderson Doe-Stapleton, VII
Jon Anderson Doe-Stapleton
"""

_ROMAN_NUMERAL_RE = re.compile(r'^[IVXLCDM]+$')

for name in strings.strip().splitlines():
    words = name.rsplit(' ', maxsplit=2)
    last_name = words[-2].rstrip(',') if _ROMAN_NUMERAL_RE.match(last := words[-1]) else last
    print(f'{name.ljust(35)} - [ {last_name} ]')

Result:

Jon Doe                             - [ Doe ]
Jon A. Doe                          - [ Doe ]
Jon Anderson Doe                    - [ Doe ]
Jon Doe II                          - [ Doe ]
Jon Doe, IV                         - [ Doe ]
Jon A. Doe, V                       - [ Doe ]
Jon A. Doe X                        - [ Doe ]
Jon Anderson Doe, VI                - [ Doe ]
Jon Anderson Doe VII                - [ Doe ]
Jon Anderson Doe-Stapleton VII      - [ Doe-Stapleton ]
Jon Anderson Doe-Stapleton, VII     - [ Doe-Stapleton ]
Jon Anderson Doe-Stapleton          - [ Doe-Stapleton ]

Performance

This version is actually much faster (more than 5x) than a more complex approach with re.

See benchmark tests with timeit below:

import re
from timeit import timeit

strings = """
Jon Doe
Jon A. Doe
Jon Anderson Doe
Jon Doe II
Jon Doe, IV
Jon A. Doe, V
Jon A. Doe X
Jon Anderson Doe, VI
Jon Anderson Doe VII
Jon Anderson Doe-Stapleton VII
Jon Anderson Doe-Stapleton, VII
Jon Anderson Doe-Stapleton
"""

_ROMAN_NUMERAL_RE = re.compile(r'^[IVXLCDM]+$')


def last_name(name: str):
    words = name.rsplit(' ', maxsplit=2)
    return words[-2].rstrip(',') if _ROMAN_NUMERAL_RE.match(last := words[-1]) else last


def last_name_re(name: str):
    return re.sub(r"(,.)?(M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3}))$", "", name).strip().split(' ')[-1]


# for name in strings.strip().splitlines():
#     print(f'{name.ljust(35)} - [ {last_name_re(name)} ]')

n = 100_000

print('re:            ', timeit("""
for name in strings.strip().splitlines():
    last_name(name)
""", globals=globals(), number=n))

print('re (complex):  ', timeit("""
for name in strings.strip().splitlines():
    last_name_re(name)
""", globals=globals(), number=n))

Results on my Mac:

re:             0.42575412476435304
re (complex):   3.725017166696489

Answered By: rv.kvetch