python search a file for text based on other text found (look ahead)?

Question:

I have a massive html file that I need to find text for every .jpg image in the file. The process I want to perform is:

  • search for the the image’s name referenced in an href.
  • if found look ahead for the first instance of a regex

Here is a part of the file. There are many many entries like this. I need to grab that date.

        <div class="_2ph_ _a6-p">
         <div>
          <div class="_2pin">
           <div>
            <div>
             <div>
              <div class="_a7nf">
               <div class="_a7ng">
                <div>
                 <a href="folder/image.jpg" target="_blank">
                  <img class="_a6_o _3-96" src="folder/image.jpg"/>
                 </a>
                 <div>
                  Mobile uploads
                 </div>
                 <div class="_3-95">
                  Test Test Test
                 </div>
                </div>
               </div>
              </div>
             </div>
            </div>
           </div>
          </div>
          <div class="_2pin">
           <div>
            Test Test Test
           </div>
          </div>
         </div>
        </div>
        <div class="_3-94 _a6-o">
         <a href="https://www.example.com;s=518" target="_blank">
          <div class="_a72d">
           Jun 25, 2011 12:10:54pm
          </div>
         </a>
        </div>
       </div>

        <div class="_2ph_ _a6-p">
         <div>
          <div class="_2pin">
           <div>
            <div>
             <div>
              <div class="_a7nf">
               <div class="_a7ng">
                <div>
                 <a href="folder/image2.jpg" target="_blank">
                  <img class="_a6_o _3-96" src="folder/image2.jpg"/>
                 </a>
                 <div>
                  Mobile uploads
                 </div>
                 <div class="_3-95">
                  Test Test Test Test
                 </div>
                </div>
               </div>
              </div>
             </div>
            </div>
           </div>
          </div>
          <div class="_2pin">
           <div>
            Test Test Test Test
           </div>
          </div>
         </div>
        </div>
        <div class="_3-94 _a6-o">
         <a href="https://www.example.com;s=518" target="_blank">
          <div class="_b28q">
           Feb 10, 2012 1:10:54am
          </div>
         </a>
        </div>
       </div>



        <div class="_3-95 _a6-g"> == $0
          <div class="_2pin">Testing </div>
        <div class="_3-95 _a6-p">
           <div>
            <div><div>
            <div><div>
            <div><div>
            <div><div>
             <div>
              <div>
                 <a href="folder/image3.jpg" target="_blank">
                  <img class="_a6_o _3-96" src="folder/image.jpg"/>
                 </a>
                 <div></div>
               </div>
              </div>
             </div>
            </div>
          <div class="_b28q">
           Feb 10, 2012 1:10:54am
          </div>
         </div>


I already figured out a regex that works for the date:

    rx_date = re.compile('(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)sd{1,2},sd{4}sd{1,2}:d{2}:d{2}(?:AM|PM|am|pm)')

I need to find Jun 25, 2011 12:10:54pm for the reference of image.jpg and Feb 10, 2012 1:10:54am for the reference of image2.jpg. How can I accomplish that?

I messed around with using beautiful soup, but all I can do with that is gather parts of the file. I could not figure out how to look ahead and tell beautiful soup. I tried using .parent.parent.parent.parent.parent.parent.parent.parent.child but that didn’t work. Note every div class name is random so I can use that as a reference.

EDIT:
I added one little monkey wrench in the logic. Some times the date is not in an a tag but in a div class by itself. html example updated.

Asked By: Dave

||

Answers:

Maybe you can use bs4 API. Find <a> tag that contains <img> and for a date find next <a> tag that contains <div>:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc, "html.parser")  # html_doc contains the snippet from your question

for img in soup.select("a > img"):
    src = img["src"]
    date = img.find_next(lambda tag: tag.name == "a" and tag.div).text.strip()
    print(f"{src=} {date=}")

Prints:

src='folder/image.jpg' date='Jun 25, 2011 12:10:54pm'
src='folder/image2.jpg' date='Feb 10, 2012 1:10:54am'

EDIT: With updated input:

import re

rx_date = re.compile(
    r"(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)sd{1,2},sd{4}sd{1,2}:d{2}:d{2}(?:AM|PM|am|pm)"
)


for img in soup.select("a > img"):
    src = img["src"]
    date = img.find_next(
        lambda tag: tag.name == "div"
        and rx_date.search(tag.find(text=True, recursive=False) or "")
    ).text.strip()
    print(f"{src=} {date=}")

Prints:

src='folder/image.jpg' date='Jun 25, 2011 12:10:54pm'
src='folder/image2.jpg' date='Feb 10, 2012 1:10:54am'
src='folder/image.jpg' date='Feb 10, 2012 1:10:54am'
Answered By: Andrej Kesely
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.