Beautiful Soup data extract

Question:

Have an local .html from which I am extracting point data, parsed with BeautifulSoup but I don’t know how to extract the date that is inside a div, the parse array is the following:

<div class="_a6-p"><div><div><a href="https://www.instagram.com/chuckbasspics" target="_blank">chuckbasspics</a></div><div>Jan 7, 2013, 5:41 AM</div></div></div><div class="_3-94 _a6-o"></div></div><div class="pam _3-95 _2ph- _a6-g uiBoxWhite noborder"><div class="_a6-p"><div><div>

Any idea how to do it?

I already extracted the users and urls (href) with the following code:

fl_html = open('followers.html', "r")
index = fl_html.read()
soup = BeautifulSoup(index, 'lxml')

usernames = soup.find_all('a', href=True)


for i in usernames:
    users.append(i.get_text(strip=True))
    url_follower.append(i['href'])
Asked By: schradernm

||

Answers:

You can use bs4 API or CSS selector:

from bs4 import BeautifulSoup

html_doc = """<div class="_a6-p"><div><div><a href="https://www.instagram.com/chuckbasspics" target="_blank">chuckbasspics</a></div><div>Jan 7, 2013, 5:41 AM</div></div></div><div class="_3-94 _a6-o"></div></div><div class="pam _3-95 _2ph- _a6-g uiBoxWhite noborder"><div class="_a6-p"><div><div>"""

soup = BeautifulSoup(html_doc, "html.parser")

Extracting the date using .get_text() with separator=

You can get all text from the HTML snippet with custom separator, then .split:

t = soup.get_text(strip=True, separator="|").split("|")
print(t[1])

Prints:

Jan 7, 2013, 5:41 AM

CSS selector

Find next sibling to <div> which contains <a>:

t = soup.select_one("div:has(a) + div")
print(t.text)

Print:

Jan 7, 2013, 5:41 AM

Using bs4 API

Time must contain PM or AM, so select <div> which contains this string:

t = soup.find("div", text=lambda t: t and (" AM" in t or " PM" in t))
print(t.text)

Prints:

Jan 7, 2013, 5:41 AM
Answered By: Andrej Kesely
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.