How to separate part of the string where it is all upper case?

Question:

Below is a sample dataset in html:

<div class="leader-info"><h4>Director of IR</h4><p>Diane PHILIPS</p></div>,
<div class="leader-info"><h4>Director of Finance</h4><p>Nancy LOPEZ</p></div>,
<div class="leader-info"><h4>Director of HR</h4><p>George SANTOZ</p></div>,
<div class="leader-info"><h4>Director of </h4><p>KUMBARO FURXHI Mirela</p></div>

I utilized BeautifulSoup to extract the data and pipe delimited h4 and p.

for leader_list in soup.findAll(attrs={'class':'leader-info'}):
print(leader_list.get_text(strip=True, separator='|'))

However, I want to separate the given name and the surname within the p tag. The surname is in all caps and can be in the beginning or at the end of the string. It can also be multiple words with space in between. How would I go about transforming the output into the following?

Director of IR|Diane|PHILIPS
Director of Finance|Nancy|LOPEZ
Director of HR|George|SANTOZ
Director of HR|KUMBARO FURXHI|Mirela
Asked By: John Al

||

Answers:

In newer code avoid old syntax findAll() instead use find_all() or select() with css selectors – For more take a minute to check docs


You could check if a string .isupper() and in addition there is .capitalize() to "normalize" the last_name, if needed:

'last_name':' '.join([s.capitalize() for s in e.p.text.split() if s.isupper()])

I would recommend to use a more structured way to store your results instead of just printing.

Example
from bs4 import BeautifulSoup

html = '''
<div class="leader-info"><h4>Director of IR</h4><p>Diane PHILIPS</p></div>,
<div class="leader-info"><h4>Director of Finance</h4><p>Nancy LOPEZ</p></div>,
<div class="leader-info"><h4>Director of HR</h4><p>George SANTOZ</p></div>,
<div class="leader-info"><h4>Director of </h4><p>KUMBARO FURXHI Mirela</p></div>
'''
soup = BeautifulSoup(html)

data = []

for e in soup.select('.leader-info'):
   d = {
      'title':e.h4.text,
      'first_name':' '.join([s for s in e.p.text.split() if not s.isupper()]),
      'last_name':' '.join([s.capitalize() for s in e.p.text.split() if s.isupper()])
   }
   data.append(d)
   print('|'.join(d.values()))

data
Output

Print:

Director of IR|Diane|Philips
Director of Finance|Nancy|Lopez
Director of HR|George|Santoz
Director of |Mirela|Kumbaro Furxhi

data:

[{'title': 'Director of IR', 'first_name': 'Diane', 'last_name': 'Philips'},
 {'title': 'Director of Finance', 'first_name': 'Nancy', 'last_name': 'Lopez'},
 {'title': 'Director of HR', 'first_name': 'George', 'last_name': 'Santoz'},
 {'title': 'Director of ',
  'first_name': 'Mirela',
  'last_name': 'Kumbaro Furxhi'}]
Answered By: HedgeHog