Scraping <a href> and title from some <div class = "xxx">

Question:

I am doing web scraping and have done this so far-

page = requests.get('http://abcdefgh.in')
print(page.status_code)
soup = BeautifulSoup(page.content, 'html.parser')
all_p = soup.find_all(class_="p-list-sec")
print((all_p))

After doing this, I have something like this when I print all_p-

<div class = "p-list-sec">
<UI> <li>  < a href = "link1", title = "tltle1">title1<a/></li>
     <li>  < a href = "link2", title = "tltle2">title2<a/></li>
     <li>  < a href = "link3", title = "tltle3">title3<a/></li>
</ui>
</div>

<div class = "p-list-sec">
<UI> <li>  < a href = "link1", title = "tltle1">title1<a/></li>
     <li>  < a href = "link2", title = "tltle2">title2<a/></li>
     <li>  < a href = "link3", title = "tltle3">title3<a/></li>
</ui>
</div>

<div class = "p-list-sec">
<UI> <li>  < a href = "link1", title = "tltle1">title1<a/></li>
     <li>  < a href = "link2", title = "tltle2">title2<a/></li>
     <li>  < a href = "link3", title = "tltle3">title3<a/></li>
</ui>
</div> and so on up to around 40 div classes. 

Now I want to extract all the a href and title inside class p-list-sec and want to store them into file. I know how to store them into file but extracting all the a href and title from the all p-list-sec class is something which is creating issue for me.
I am using python 3.9 and requests and beautifulsoup libraries in windows 10 using command prompt.

Thanks,
akhi

Asked By: ABD

||

Answers:

In case you don’t mind about div name, here is a oneliner:

import re

with open("data.html", "r") as msg:
    data = msg.readlines()

data = [tuple(re.sub(r'.+href = "(.+)",.+title = "(.+)".+',r'1'+' '+r'2',v).split()) for v in [v.strip() for v in data if "href" in v]]

Output:

[('link1', 'tltle1'), ('link2', 'tltle2'), ('link3', 'tltle3'), ('link1', 'tltle1'), ('link2', 'tltle2'), ('link3', 'tltle3'), ('link1', 'tltle1'), ('link2', 'tltle2'), ('link3', 'tltle3')]

Otherwise:

with open("data.html", "r") as msg:
    data = msg.readlines()

div_write = False
href_write = False

wdata = []; odata = []

for line in data:
    if '<div class =' in line:
        class_name = line.split("<div class =")[1].split(">")[0].strip()
        div_write = True
    if "</div>" in line and div_write == True:
        odata.append(wdata)
        wdata = []
        div_write = False

    if div_write == True and "< a href" in line:
        href = line.strip().split("< a href =")[1].split(",")[0].strip()
        title = line.strip().split("title =")[1].split(">")[0].strip()
        wdata.append(class_name+" "+href+" "+title)

with open("out.dat", "w") as msg:
    for wdata in odata:
        msg.write("n".join(wdata)+"nn")

With this you save a file in which you keep track of the information and section name.

Output:

"p-list-sec" "link1" "tltle1"
"p-list-sec" "link2" "tltle2"
"p-list-sec" "link3" "tltle3"

"p-list-sec" "link1" "tltle1"
"p-list-sec" "link2" "tltle2"
"p-list-sec" "link3" "tltle3"

"p-list-sec" "link1" "tltle1"
"p-list-sec" "link2" "tltle2"
"p-list-sec" "link3" "tltle3"
Answered By: Synthase

Would this work?

...

for p in all_p:
    for link in p.find_all('a'):
        print(link['href'])
        print(link.text) # or link['title']
Answered By: Yevhen Kuzmovych

Just in case

Just in case you want to avoid looping twice, you can also use the BeautifulSoup css selector and chain class and <a>. So take your soup and select like this:

soup.select('.p-list-sec a')

To shape the information you like to process you can use a single for loop or a list comprehension all in one line:

[{'url':link['href'], 'title':link['title']} for link in soup.select('.p-list-sec a')]

Output

[{'url': 'link1', 'title': 'tltle1'},
 {'url': 'link2', 'title': 'tltle2'},
 {'url': 'link3', 'title': 'tltle3'},
 {'url': 'link1', 'title': 'tltle1'},
 {'url': 'link2', 'title': 'tltle2'},
 {'url': 'link3', 'title': 'tltle3'},
 {'url': 'link1', 'title': 'tltle1'},
 {'url': 'link2', 'title': 'tltle2'},
 {'url': 'link3', 'title': 'tltle3'}]

To store it in an csv feel free to push it into pandas or csv

Pandas:

import pandas as pd

pd.DataFrame([{'url':link['href'], 'title':link['title']} for link in soup.select('.p-list-sec a')]).to_csv('url.csv', index=False)

CSV:

import csv
data_list = [{'url':link['href'], 'title':link['title']} for link in soup.select('.p-list-sec a')]

keys = data_list[0].keys()

with open('url.csv', 'w') as output_file:
    dict_writer = csv.DictWriter(output_file, keys)
    dict_writer.writeheader()
    dict_writer.writerows(data_list)
Answered By: HedgeHog

I was able to do this by this-

for p in all_p:
    for b in p.findAll('a'):                                         
        fullLink = str(b.get('href'))
        title = str(b.get('title'))
        href = 'link = {}, title = {}n'.format(fullLink, title)
        print(href)

It works fine for me.
Thanks

Answered By: ABD
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.