Use BeautifulSoup to get previous tag data
Question:
I’m trying to create a list of dates and a respective link associated with those dates.
Currently, I have code that looks like this:
date_list = [tag["datetime"] for tag in new_soup.findAll(attrs={"datetime" : True})]
This will get me all of the values associated with "datetime" in the new_soup html.
Now, what if, for every date that I add into this list, I wanted to add the link associated with it which happens to be in the previous tag:
html example
<a class="Link--secondary ml-2"
data-pjax="#repo-content-pjax-container"
data-turbo-frame="repo-content-turbo-frame"
href="the link right here">
<relative-time class="no-wrap"
datetime="2023-03-07T02:38:29Z"
title="Mar 6, 2023, 7:38 PM MST">Mar 6, 2023
</relative-time>
Answers:
You can try to use tag.find_previous()
:
from bs4 import BeautifulSoup
html_doc = '''
<a class="Link--secondary ml-2"
data-pjax="#repo-content-pjax-container"
data-turbo-frame="repo-content-turbo-frame"
href="the link right here">
</a>
<relative-time class="no-wrap"
datetime="2023-03-07T02:38:29Z"
title="Mar 6, 2023, 7:38 PM MST">Mar 6, 2023
</relative-time>'''
soup = BeautifulSoup(html_doc, 'html.parser')
date_list = [(tag["datetime"], tag.find_previous('a')['href']) for tag in soup.findAll(attrs={"datetime" : True})]
print(date_list)
Prints:
[('2023-03-07T02:38:29Z', 'the link right here')]
I’m trying to create a list of dates and a respective link associated with those dates.
Currently, I have code that looks like this:
date_list = [tag["datetime"] for tag in new_soup.findAll(attrs={"datetime" : True})]
This will get me all of the values associated with "datetime" in the new_soup html.
Now, what if, for every date that I add into this list, I wanted to add the link associated with it which happens to be in the previous tag:
html example
<a class="Link--secondary ml-2"
data-pjax="#repo-content-pjax-container"
data-turbo-frame="repo-content-turbo-frame"
href="the link right here">
<relative-time class="no-wrap"
datetime="2023-03-07T02:38:29Z"
title="Mar 6, 2023, 7:38 PM MST">Mar 6, 2023
</relative-time>
You can try to use tag.find_previous()
:
from bs4 import BeautifulSoup
html_doc = '''
<a class="Link--secondary ml-2"
data-pjax="#repo-content-pjax-container"
data-turbo-frame="repo-content-turbo-frame"
href="the link right here">
</a>
<relative-time class="no-wrap"
datetime="2023-03-07T02:38:29Z"
title="Mar 6, 2023, 7:38 PM MST">Mar 6, 2023
</relative-time>'''
soup = BeautifulSoup(html_doc, 'html.parser')
date_list = [(tag["datetime"], tag.find_previous('a')['href']) for tag in soup.findAll(attrs={"datetime" : True})]
print(date_list)
Prints:
[('2023-03-07T02:38:29Z', 'the link right here')]