How to get text from and HTML with a bit weird structure?
Question:
I have a website with HTML structure like this inside it:
<div class="ui-rectframe">
<p class="ui-li-desc"></p>
<h4 class="ui-li-heading">Qualifications</h4>
MBBS (University of Singapore, Singapore) 1978
<br>
MCFP (Family Med) (College of Family Physicians, Singapore) 1984
<br>
Dip Geriatric Med (NUS, Singapore) 2012
<br>
GDPM (NUS, Singapore) 2015
<br>
<h4 class="ui-li-heading">Type of first registration / date</h4>
Full Registration (14/06/1979)<br>
<h4 class="ui-li-heading">Type of current registration / date</h4>
Full Registration (14/06/1979)<br>
<h4 class="ui-li-heading">Practising Certificate Start Date</h4>
01/01/2022<br>
<h4 class="ui-li-heading">Practising Certificate End Date</h4>
31/12/2023<br>
<p></p><br>
</div>
I need to extract qualifications — [ 'MBBS (University of Singapore, Singapore) 1978', 'MCFP (Family Med) (College of Family Physicians, Singapore) 1984', 'Dip Geriatric Med (NUS, Singapore) 2012', 'GDPM (NUS, Singapore) 2015' ]
How can I do that using css selector or xpath? I am able to extract all text items inside that parent div, but I can’t separate qualifications from other values like Type of first registration, etc.
Answers:
This is a bit hacky but gets you the expected result (for this particular HTML example).
Try:
import re
import requests
from bs4 import BeautifulSoup
sample = '''<div class="ui-rectframe">
<p class="ui-li-desc"></p>
<h4 class="ui-li-heading">Qualifications</h4>
MBBS (University of Singapore, Singapore) 1978
<br>
MCFP (Family Med) (College of Family Physicians, Singapore) 1984
<br>
Dip Geriatric Med (NUS, Singapore) 2012
<br>
GDPM (NUS, Singapore) 2015
<br>
<h4 class="ui-li-heading">Type of first registration / date</h4>
Full Registration (14/06/1979)<br>
<h4 class="ui-li-heading">Type of current registration / date</h4>
Full Registration (14/06/1979)<br>
<h4 class="ui-li-heading">Practising Certificate Start Date</h4>
01/01/2022<br>
<h4 class="ui-li-heading">Practising Certificate End Date</h4>
31/12/2023<br>
<p></p><br>
</div>
'''
soup = BeautifulSoup(sample, 'html.parser').text
output = [
x.strip() for x in soup.splitlines()
if re.search(r'([A-Z]{1,4}.*)sd+)', x)
]
Output:
['MBBS (University of Singapore, Singapore) 1978', 'MCFP (Family Med) (College of Family Physicians, Singapore) 1984', 'Dip Geriatric Med (NUS, Singapore) 2012', 'GDPM (NUS, Singapore) 2015']
You could extract a list
of headers and one of all stripped_strings
and use a function to seperate them by checking against the headers:
def create_dict(strings, headers):
idx = 0
d = {}
for header in headers:
sublist = []
while strings[idx] != header:
sublist.append(strings[idx])
idx += 1
if sublist:
d.update({sublist[0]:sublist[1:]})
return(d)
h = [e.get_text(strip=True) for e in soup.select('div h4')]
s = list(soup.div.stripped_strings)
create_dict(s,h)
Output:
Note – This will store results in dict
to pick also from the other sections if necessary:
{'Qualifications': ['MBBS (University of Singapore, Singapore) 1978',
'MCFP (Family Med) (College of Family Physicians, Singapore) 1984',
'Dip Geriatric Med (NUS, Singapore) 2012',
'GDPM (NUS, Singapore) 2015'],
'Type of first registration / date': ['Full Registration (14/06/1979)'],
'Type of current registration / date': ['Full Registration (14/06/1979)'],
'Practising Certificate Start Date': ['01/01/2022']}
Example
from bs4 import BeautifulSoup
html = '''
<div class="ui-rectframe">
<p class="ui-li-desc"></p>
<h4 class="ui-li-heading">Qualifications</h4>
MBBS (University of Singapore, Singapore) 1978
<br>
MCFP (Family Med) (College of Family Physicians, Singapore) 1984
<br>
Dip Geriatric Med (NUS, Singapore) 2012
<br>
GDPM (NUS, Singapore) 2015
<br>
<h4 class="ui-li-heading">Type of first registration / date</h4>
Full Registration (14/06/1979)<br>
<h4 class="ui-li-heading">Type of current registration / date</h4>
Full Registration (14/06/1979)<br>
<h4 class="ui-li-heading">Practising Certificate Start Date</h4>
01/01/2022<br>
<h4 class="ui-li-heading">Practising Certificate End Date</h4>
31/12/2023<br>
<p></p><br>
</div>
'''
soup = BeautifulSoup(html)
def create_dict(strings, headers):
idx = 0
d = {}
for header in headers:
sublist = []
while strings[idx] != header:
sublist.append(strings[idx])
idx += 1
if sublist:
d.update({sublist[0]:sublist[1:]})
return(d)
h = [e.get_text(strip=True) for e in soup.select('div h4')]
s = list(soup.div.stripped_strings)
create_dict(s,h)
This is another way of achieving the same:
from bs4 import BeautifulSoup
html = """
<div class="ui-rectframe">
<p class="ui-li-desc"></p>
<h4 class="ui-li-heading">Qualifications</h4>
MBBS (University of Singapore, Singapore) 1978
<br>
MCFP (Family Med) (College of Family Physicians, Singapore) 1984
<br>
Dip Geriatric Med (NUS, Singapore) 2012
<br>
GDPM (NUS, Singapore) 2015
<br>
<h4 class="ui-li-heading">Type of first registration / date</h4>
Full Registration (14/06/1979)<br>
<h4 class="ui-li-heading">Type of current registration / date</h4>
Full Registration (14/06/1979)<br>
<h4 class="ui-li-heading">Practising Certificate Start Date</h4>
01/01/2022<br>
<h4 class="ui-li-heading">Practising Certificate End Date</h4>
31/12/2023<br>
<p></p><br>
</div>
"""
soup = BeautifulSoup(html,"html.parser")
data_dict = {}
for item in soup.select("h4.ui-li-heading"):
header = item.get_text(strip=True)
content = []
for i in item.next_siblings:
if i.name=="h4":
break
content.extend([x for x in i.stripped_strings])
data_dict[header] = content
print(data_dict)
I have a website with HTML structure like this inside it:
<div class="ui-rectframe">
<p class="ui-li-desc"></p>
<h4 class="ui-li-heading">Qualifications</h4>
MBBS (University of Singapore, Singapore) 1978
<br>
MCFP (Family Med) (College of Family Physicians, Singapore) 1984
<br>
Dip Geriatric Med (NUS, Singapore) 2012
<br>
GDPM (NUS, Singapore) 2015
<br>
<h4 class="ui-li-heading">Type of first registration / date</h4>
Full Registration (14/06/1979)<br>
<h4 class="ui-li-heading">Type of current registration / date</h4>
Full Registration (14/06/1979)<br>
<h4 class="ui-li-heading">Practising Certificate Start Date</h4>
01/01/2022<br>
<h4 class="ui-li-heading">Practising Certificate End Date</h4>
31/12/2023<br>
<p></p><br>
</div>
I need to extract qualifications — [ 'MBBS (University of Singapore, Singapore) 1978', 'MCFP (Family Med) (College of Family Physicians, Singapore) 1984', 'Dip Geriatric Med (NUS, Singapore) 2012', 'GDPM (NUS, Singapore) 2015' ]
How can I do that using css selector or xpath? I am able to extract all text items inside that parent div, but I can’t separate qualifications from other values like Type of first registration, etc.
This is a bit hacky but gets you the expected result (for this particular HTML example).
Try:
import re
import requests
from bs4 import BeautifulSoup
sample = '''<div class="ui-rectframe">
<p class="ui-li-desc"></p>
<h4 class="ui-li-heading">Qualifications</h4>
MBBS (University of Singapore, Singapore) 1978
<br>
MCFP (Family Med) (College of Family Physicians, Singapore) 1984
<br>
Dip Geriatric Med (NUS, Singapore) 2012
<br>
GDPM (NUS, Singapore) 2015
<br>
<h4 class="ui-li-heading">Type of first registration / date</h4>
Full Registration (14/06/1979)<br>
<h4 class="ui-li-heading">Type of current registration / date</h4>
Full Registration (14/06/1979)<br>
<h4 class="ui-li-heading">Practising Certificate Start Date</h4>
01/01/2022<br>
<h4 class="ui-li-heading">Practising Certificate End Date</h4>
31/12/2023<br>
<p></p><br>
</div>
'''
soup = BeautifulSoup(sample, 'html.parser').text
output = [
x.strip() for x in soup.splitlines()
if re.search(r'([A-Z]{1,4}.*)sd+)', x)
]
Output:
['MBBS (University of Singapore, Singapore) 1978', 'MCFP (Family Med) (College of Family Physicians, Singapore) 1984', 'Dip Geriatric Med (NUS, Singapore) 2012', 'GDPM (NUS, Singapore) 2015']
You could extract a list
of headers and one of all stripped_strings
and use a function to seperate them by checking against the headers:
def create_dict(strings, headers):
idx = 0
d = {}
for header in headers:
sublist = []
while strings[idx] != header:
sublist.append(strings[idx])
idx += 1
if sublist:
d.update({sublist[0]:sublist[1:]})
return(d)
h = [e.get_text(strip=True) for e in soup.select('div h4')]
s = list(soup.div.stripped_strings)
create_dict(s,h)
Output:
Note – This will store results in dict
to pick also from the other sections if necessary:
{'Qualifications': ['MBBS (University of Singapore, Singapore) 1978',
'MCFP (Family Med) (College of Family Physicians, Singapore) 1984',
'Dip Geriatric Med (NUS, Singapore) 2012',
'GDPM (NUS, Singapore) 2015'],
'Type of first registration / date': ['Full Registration (14/06/1979)'],
'Type of current registration / date': ['Full Registration (14/06/1979)'],
'Practising Certificate Start Date': ['01/01/2022']}
Example
from bs4 import BeautifulSoup
html = '''
<div class="ui-rectframe">
<p class="ui-li-desc"></p>
<h4 class="ui-li-heading">Qualifications</h4>
MBBS (University of Singapore, Singapore) 1978
<br>
MCFP (Family Med) (College of Family Physicians, Singapore) 1984
<br>
Dip Geriatric Med (NUS, Singapore) 2012
<br>
GDPM (NUS, Singapore) 2015
<br>
<h4 class="ui-li-heading">Type of first registration / date</h4>
Full Registration (14/06/1979)<br>
<h4 class="ui-li-heading">Type of current registration / date</h4>
Full Registration (14/06/1979)<br>
<h4 class="ui-li-heading">Practising Certificate Start Date</h4>
01/01/2022<br>
<h4 class="ui-li-heading">Practising Certificate End Date</h4>
31/12/2023<br>
<p></p><br>
</div>
'''
soup = BeautifulSoup(html)
def create_dict(strings, headers):
idx = 0
d = {}
for header in headers:
sublist = []
while strings[idx] != header:
sublist.append(strings[idx])
idx += 1
if sublist:
d.update({sublist[0]:sublist[1:]})
return(d)
h = [e.get_text(strip=True) for e in soup.select('div h4')]
s = list(soup.div.stripped_strings)
create_dict(s,h)
This is another way of achieving the same:
from bs4 import BeautifulSoup
html = """
<div class="ui-rectframe">
<p class="ui-li-desc"></p>
<h4 class="ui-li-heading">Qualifications</h4>
MBBS (University of Singapore, Singapore) 1978
<br>
MCFP (Family Med) (College of Family Physicians, Singapore) 1984
<br>
Dip Geriatric Med (NUS, Singapore) 2012
<br>
GDPM (NUS, Singapore) 2015
<br>
<h4 class="ui-li-heading">Type of first registration / date</h4>
Full Registration (14/06/1979)<br>
<h4 class="ui-li-heading">Type of current registration / date</h4>
Full Registration (14/06/1979)<br>
<h4 class="ui-li-heading">Practising Certificate Start Date</h4>
01/01/2022<br>
<h4 class="ui-li-heading">Practising Certificate End Date</h4>
31/12/2023<br>
<p></p><br>
</div>
"""
soup = BeautifulSoup(html,"html.parser")
data_dict = {}
for item in soup.select("h4.ui-li-heading"):
header = item.get_text(strip=True)
content = []
for i in item.next_siblings:
if i.name=="h4":
break
content.extend([x for x in i.stripped_strings])
data_dict[header] = content
print(data_dict)