Scraping <a href> and title from some <div class = "xxx">
Question:
I am doing web scraping and have done this so far-
page = requests.get('http://abcdefgh.in')
print(page.status_code)
soup = BeautifulSoup(page.content, 'html.parser')
all_p = soup.find_all(class_="p-list-sec")
print((all_p))
After doing this, I have something like this when I print all_p-
<div class = "p-list-sec">
<UI> <li> < a href = "link1", title = "tltle1">title1<a/></li>
<li> < a href = "link2", title = "tltle2">title2<a/></li>
<li> < a href = "link3", title = "tltle3">title3<a/></li>
</ui>
</div>
<div class = "p-list-sec">
<UI> <li> < a href = "link1", title = "tltle1">title1<a/></li>
<li> < a href = "link2", title = "tltle2">title2<a/></li>
<li> < a href = "link3", title = "tltle3">title3<a/></li>
</ui>
</div>
<div class = "p-list-sec">
<UI> <li> < a href = "link1", title = "tltle1">title1<a/></li>
<li> < a href = "link2", title = "tltle2">title2<a/></li>
<li> < a href = "link3", title = "tltle3">title3<a/></li>
</ui>
</div> and so on up to around 40 div classes.
Now I want to extract all the a href and title inside class p-list-sec and want to store them into file. I know how to store them into file but extracting all the a href and title from the all p-list-sec class is something which is creating issue for me.
I am using python 3.9 and requests and beautifulsoup libraries in windows 10 using command prompt.
Thanks,
akhi
Answers:
In case you don’t mind about div name, here is a oneliner:
import re
with open("data.html", "r") as msg:
data = msg.readlines()
data = [tuple(re.sub(r'.+href = "(.+)",.+title = "(.+)".+',r'1'+' '+r'2',v).split()) for v in [v.strip() for v in data if "href" in v]]
Output:
[('link1', 'tltle1'), ('link2', 'tltle2'), ('link3', 'tltle3'), ('link1', 'tltle1'), ('link2', 'tltle2'), ('link3', 'tltle3'), ('link1', 'tltle1'), ('link2', 'tltle2'), ('link3', 'tltle3')]
Otherwise:
with open("data.html", "r") as msg:
data = msg.readlines()
div_write = False
href_write = False
wdata = []; odata = []
for line in data:
if '<div class =' in line:
class_name = line.split("<div class =")[1].split(">")[0].strip()
div_write = True
if "</div>" in line and div_write == True:
odata.append(wdata)
wdata = []
div_write = False
if div_write == True and "< a href" in line:
href = line.strip().split("< a href =")[1].split(",")[0].strip()
title = line.strip().split("title =")[1].split(">")[0].strip()
wdata.append(class_name+" "+href+" "+title)
with open("out.dat", "w") as msg:
for wdata in odata:
msg.write("n".join(wdata)+"nn")
With this you save a file in which you keep track of the information and section name.
Output:
"p-list-sec" "link1" "tltle1"
"p-list-sec" "link2" "tltle2"
"p-list-sec" "link3" "tltle3"
"p-list-sec" "link1" "tltle1"
"p-list-sec" "link2" "tltle2"
"p-list-sec" "link3" "tltle3"
"p-list-sec" "link1" "tltle1"
"p-list-sec" "link2" "tltle2"
"p-list-sec" "link3" "tltle3"
Would this work?
...
for p in all_p:
for link in p.find_all('a'):
print(link['href'])
print(link.text) # or link['title']
Just in case
Just in case you want to avoid looping twice, you can also use the BeautifulSoup css selector and chain class
and <a>
. So take your soup and select like this:
soup.select('.p-list-sec a')
To shape the information you like to process you can use a single for loop or a list comprehension all in one line:
[{'url':link['href'], 'title':link['title']} for link in soup.select('.p-list-sec a')]
Output
[{'url': 'link1', 'title': 'tltle1'},
{'url': 'link2', 'title': 'tltle2'},
{'url': 'link3', 'title': 'tltle3'},
{'url': 'link1', 'title': 'tltle1'},
{'url': 'link2', 'title': 'tltle2'},
{'url': 'link3', 'title': 'tltle3'},
{'url': 'link1', 'title': 'tltle1'},
{'url': 'link2', 'title': 'tltle2'},
{'url': 'link3', 'title': 'tltle3'}]
To store it in an csv feel free to push it into pandas
or csv
Pandas:
import pandas as pd
pd.DataFrame([{'url':link['href'], 'title':link['title']} for link in soup.select('.p-list-sec a')]).to_csv('url.csv', index=False)
CSV:
import csv
data_list = [{'url':link['href'], 'title':link['title']} for link in soup.select('.p-list-sec a')]
keys = data_list[0].keys()
with open('url.csv', 'w') as output_file:
dict_writer = csv.DictWriter(output_file, keys)
dict_writer.writeheader()
dict_writer.writerows(data_list)
I was able to do this by this-
for p in all_p:
for b in p.findAll('a'):
fullLink = str(b.get('href'))
title = str(b.get('title'))
href = 'link = {}, title = {}n'.format(fullLink, title)
print(href)
It works fine for me.
Thanks
I am doing web scraping and have done this so far-
page = requests.get('http://abcdefgh.in')
print(page.status_code)
soup = BeautifulSoup(page.content, 'html.parser')
all_p = soup.find_all(class_="p-list-sec")
print((all_p))
After doing this, I have something like this when I print all_p-
<div class = "p-list-sec">
<UI> <li> < a href = "link1", title = "tltle1">title1<a/></li>
<li> < a href = "link2", title = "tltle2">title2<a/></li>
<li> < a href = "link3", title = "tltle3">title3<a/></li>
</ui>
</div>
<div class = "p-list-sec">
<UI> <li> < a href = "link1", title = "tltle1">title1<a/></li>
<li> < a href = "link2", title = "tltle2">title2<a/></li>
<li> < a href = "link3", title = "tltle3">title3<a/></li>
</ui>
</div>
<div class = "p-list-sec">
<UI> <li> < a href = "link1", title = "tltle1">title1<a/></li>
<li> < a href = "link2", title = "tltle2">title2<a/></li>
<li> < a href = "link3", title = "tltle3">title3<a/></li>
</ui>
</div> and so on up to around 40 div classes.
Now I want to extract all the a href and title inside class p-list-sec and want to store them into file. I know how to store them into file but extracting all the a href and title from the all p-list-sec class is something which is creating issue for me.
I am using python 3.9 and requests and beautifulsoup libraries in windows 10 using command prompt.
Thanks,
akhi
In case you don’t mind about div name, here is a oneliner:
import re
with open("data.html", "r") as msg:
data = msg.readlines()
data = [tuple(re.sub(r'.+href = "(.+)",.+title = "(.+)".+',r'1'+' '+r'2',v).split()) for v in [v.strip() for v in data if "href" in v]]
Output:
[('link1', 'tltle1'), ('link2', 'tltle2'), ('link3', 'tltle3'), ('link1', 'tltle1'), ('link2', 'tltle2'), ('link3', 'tltle3'), ('link1', 'tltle1'), ('link2', 'tltle2'), ('link3', 'tltle3')]
Otherwise:
with open("data.html", "r") as msg:
data = msg.readlines()
div_write = False
href_write = False
wdata = []; odata = []
for line in data:
if '<div class =' in line:
class_name = line.split("<div class =")[1].split(">")[0].strip()
div_write = True
if "</div>" in line and div_write == True:
odata.append(wdata)
wdata = []
div_write = False
if div_write == True and "< a href" in line:
href = line.strip().split("< a href =")[1].split(",")[0].strip()
title = line.strip().split("title =")[1].split(">")[0].strip()
wdata.append(class_name+" "+href+" "+title)
with open("out.dat", "w") as msg:
for wdata in odata:
msg.write("n".join(wdata)+"nn")
With this you save a file in which you keep track of the information and section name.
Output:
"p-list-sec" "link1" "tltle1"
"p-list-sec" "link2" "tltle2"
"p-list-sec" "link3" "tltle3"
"p-list-sec" "link1" "tltle1"
"p-list-sec" "link2" "tltle2"
"p-list-sec" "link3" "tltle3"
"p-list-sec" "link1" "tltle1"
"p-list-sec" "link2" "tltle2"
"p-list-sec" "link3" "tltle3"
Would this work?
...
for p in all_p:
for link in p.find_all('a'):
print(link['href'])
print(link.text) # or link['title']
Just in case
Just in case you want to avoid looping twice, you can also use the BeautifulSoup css selector and chain class
and <a>
. So take your soup and select like this:
soup.select('.p-list-sec a')
To shape the information you like to process you can use a single for loop or a list comprehension all in one line:
[{'url':link['href'], 'title':link['title']} for link in soup.select('.p-list-sec a')]
Output
[{'url': 'link1', 'title': 'tltle1'},
{'url': 'link2', 'title': 'tltle2'},
{'url': 'link3', 'title': 'tltle3'},
{'url': 'link1', 'title': 'tltle1'},
{'url': 'link2', 'title': 'tltle2'},
{'url': 'link3', 'title': 'tltle3'},
{'url': 'link1', 'title': 'tltle1'},
{'url': 'link2', 'title': 'tltle2'},
{'url': 'link3', 'title': 'tltle3'}]
To store it in an csv feel free to push it into pandas
or csv
Pandas:
import pandas as pd
pd.DataFrame([{'url':link['href'], 'title':link['title']} for link in soup.select('.p-list-sec a')]).to_csv('url.csv', index=False)
CSV:
import csv
data_list = [{'url':link['href'], 'title':link['title']} for link in soup.select('.p-list-sec a')]
keys = data_list[0].keys()
with open('url.csv', 'w') as output_file:
dict_writer = csv.DictWriter(output_file, keys)
dict_writer.writeheader()
dict_writer.writerows(data_list)
I was able to do this by this-
for p in all_p:
for b in p.findAll('a'):
fullLink = str(b.get('href'))
title = str(b.get('title'))
href = 'link = {}, title = {}n'.format(fullLink, title)
print(href)
It works fine for me.
Thanks