python search a file for text based on other text found (look ahead)?
Question:
I have a massive html file that I need to find text for every .jpg image in the file. The process I want to perform is:
- search for the the image’s name referenced in an href.
- if found look ahead for the first instance of a regex
Here is a part of the file. There are many many entries like this. I need to grab that date.
<div class="_2ph_ _a6-p">
<div>
<div class="_2pin">
<div>
<div>
<div>
<div class="_a7nf">
<div class="_a7ng">
<div>
<a href="folder/image.jpg" target="_blank">
<img class="_a6_o _3-96" src="folder/image.jpg"/>
</a>
<div>
Mobile uploads
</div>
<div class="_3-95">
Test Test Test
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
<div class="_2pin">
<div>
Test Test Test
</div>
</div>
</div>
</div>
<div class="_3-94 _a6-o">
<a href="https://www.example.com;s=518" target="_blank">
<div class="_a72d">
Jun 25, 2011 12:10:54pm
</div>
</a>
</div>
</div>
<div class="_2ph_ _a6-p">
<div>
<div class="_2pin">
<div>
<div>
<div>
<div class="_a7nf">
<div class="_a7ng">
<div>
<a href="folder/image2.jpg" target="_blank">
<img class="_a6_o _3-96" src="folder/image2.jpg"/>
</a>
<div>
Mobile uploads
</div>
<div class="_3-95">
Test Test Test Test
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
<div class="_2pin">
<div>
Test Test Test Test
</div>
</div>
</div>
</div>
<div class="_3-94 _a6-o">
<a href="https://www.example.com;s=518" target="_blank">
<div class="_b28q">
Feb 10, 2012 1:10:54am
</div>
</a>
</div>
</div>
<div class="_3-95 _a6-g"> == $0
<div class="_2pin">Testing </div>
<div class="_3-95 _a6-p">
<div>
<div><div>
<div><div>
<div><div>
<div><div>
<div>
<div>
<a href="folder/image3.jpg" target="_blank">
<img class="_a6_o _3-96" src="folder/image.jpg"/>
</a>
<div></div>
</div>
</div>
</div>
</div>
<div class="_b28q">
Feb 10, 2012 1:10:54am
</div>
</div>
I already figured out a regex that works for the date:
rx_date = re.compile('(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)sd{1,2},sd{4}sd{1,2}:d{2}:d{2}(?:AM|PM|am|pm)')
I need to find Jun 25, 2011 12:10:54pm
for the reference of image.jpg
and Feb 10, 2012 1:10:54am
for the reference of image2.jpg
. How can I accomplish that?
I messed around with using beautiful soup, but all I can do with that is gather parts of the file. I could not figure out how to look ahead and tell beautiful soup. I tried using .parent.parent.parent.parent.parent.parent.parent.parent.child
but that didn’t work. Note every div class name is random so I can use that as a reference.
EDIT:
I added one little monkey wrench in the logic. Some times the date is not in an a
tag but in a div class by itself. html example updated.
Answers:
Maybe you can use bs4
API. Find <a>
tag that contains <img>
and for a date find next <a>
tag that contains <div>
:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, "html.parser") # html_doc contains the snippet from your question
for img in soup.select("a > img"):
src = img["src"]
date = img.find_next(lambda tag: tag.name == "a" and tag.div).text.strip()
print(f"{src=} {date=}")
Prints:
src='folder/image.jpg' date='Jun 25, 2011 12:10:54pm'
src='folder/image2.jpg' date='Feb 10, 2012 1:10:54am'
EDIT: With updated input:
import re
rx_date = re.compile(
r"(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)sd{1,2},sd{4}sd{1,2}:d{2}:d{2}(?:AM|PM|am|pm)"
)
for img in soup.select("a > img"):
src = img["src"]
date = img.find_next(
lambda tag: tag.name == "div"
and rx_date.search(tag.find(text=True, recursive=False) or "")
).text.strip()
print(f"{src=} {date=}")
Prints:
src='folder/image.jpg' date='Jun 25, 2011 12:10:54pm'
src='folder/image2.jpg' date='Feb 10, 2012 1:10:54am'
src='folder/image.jpg' date='Feb 10, 2012 1:10:54am'
I have a massive html file that I need to find text for every .jpg image in the file. The process I want to perform is:
- search for the the image’s name referenced in an href.
- if found look ahead for the first instance of a regex
Here is a part of the file. There are many many entries like this. I need to grab that date.
<div class="_2ph_ _a6-p">
<div>
<div class="_2pin">
<div>
<div>
<div>
<div class="_a7nf">
<div class="_a7ng">
<div>
<a href="folder/image.jpg" target="_blank">
<img class="_a6_o _3-96" src="folder/image.jpg"/>
</a>
<div>
Mobile uploads
</div>
<div class="_3-95">
Test Test Test
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
<div class="_2pin">
<div>
Test Test Test
</div>
</div>
</div>
</div>
<div class="_3-94 _a6-o">
<a href="https://www.example.com;s=518" target="_blank">
<div class="_a72d">
Jun 25, 2011 12:10:54pm
</div>
</a>
</div>
</div>
<div class="_2ph_ _a6-p">
<div>
<div class="_2pin">
<div>
<div>
<div>
<div class="_a7nf">
<div class="_a7ng">
<div>
<a href="folder/image2.jpg" target="_blank">
<img class="_a6_o _3-96" src="folder/image2.jpg"/>
</a>
<div>
Mobile uploads
</div>
<div class="_3-95">
Test Test Test Test
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
<div class="_2pin">
<div>
Test Test Test Test
</div>
</div>
</div>
</div>
<div class="_3-94 _a6-o">
<a href="https://www.example.com;s=518" target="_blank">
<div class="_b28q">
Feb 10, 2012 1:10:54am
</div>
</a>
</div>
</div>
<div class="_3-95 _a6-g"> == $0
<div class="_2pin">Testing </div>
<div class="_3-95 _a6-p">
<div>
<div><div>
<div><div>
<div><div>
<div><div>
<div>
<div>
<a href="folder/image3.jpg" target="_blank">
<img class="_a6_o _3-96" src="folder/image.jpg"/>
</a>
<div></div>
</div>
</div>
</div>
</div>
<div class="_b28q">
Feb 10, 2012 1:10:54am
</div>
</div>
I already figured out a regex that works for the date:
rx_date = re.compile('(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)sd{1,2},sd{4}sd{1,2}:d{2}:d{2}(?:AM|PM|am|pm)')
I need to find Jun 25, 2011 12:10:54pm
for the reference of image.jpg
and Feb 10, 2012 1:10:54am
for the reference of image2.jpg
. How can I accomplish that?
I messed around with using beautiful soup, but all I can do with that is gather parts of the file. I could not figure out how to look ahead and tell beautiful soup. I tried using .parent.parent.parent.parent.parent.parent.parent.parent.child
but that didn’t work. Note every div class name is random so I can use that as a reference.
EDIT:
I added one little monkey wrench in the logic. Some times the date is not in an a
tag but in a div class by itself. html example updated.
Maybe you can use bs4
API. Find <a>
tag that contains <img>
and for a date find next <a>
tag that contains <div>
:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, "html.parser") # html_doc contains the snippet from your question
for img in soup.select("a > img"):
src = img["src"]
date = img.find_next(lambda tag: tag.name == "a" and tag.div).text.strip()
print(f"{src=} {date=}")
Prints:
src='folder/image.jpg' date='Jun 25, 2011 12:10:54pm'
src='folder/image2.jpg' date='Feb 10, 2012 1:10:54am'
EDIT: With updated input:
import re
rx_date = re.compile(
r"(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)sd{1,2},sd{4}sd{1,2}:d{2}:d{2}(?:AM|PM|am|pm)"
)
for img in soup.select("a > img"):
src = img["src"]
date = img.find_next(
lambda tag: tag.name == "div"
and rx_date.search(tag.find(text=True, recursive=False) or "")
).text.strip()
print(f"{src=} {date=}")
Prints:
src='folder/image.jpg' date='Jun 25, 2011 12:10:54pm'
src='folder/image2.jpg' date='Feb 10, 2012 1:10:54am'
src='folder/image.jpg' date='Feb 10, 2012 1:10:54am'