Beginner – trying to scrape link and export to excel in Python and BS4
Question:
I have tried to loop some web scraping from a demo site Webscraper.io – it’s a demo site with laptops, where I’m trying to scrape the title of the laptop, the price and the link for the laptops. But I’m finding it very difficult to figure out, how to scrape all the information and exporting it to excel. Particularly how do I add the link to the current information?
Here is what I have done so far:
import requests
from bs4 import BeautifulSoup
from pprint import pprint
url ="https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops"
r = requests.get(url)
html = r.text
soup = BeautifulSoup(html)
css_selector = {"class": "col-sm-4 col-lg-4 col-md-4"}
laptops = soup.find_all("div", attrs=css_selector)
for laptop in laptops:
text = laptop.get_text()
print(text)
But i still need some way to add the link for the laptops as well… and some way to export to scrapoing to excel. ‘
I have tried to export the current data to excel:
import pandas as pd
df = pd.DataFrame(laptop)
df.to_excel("laptop_.xlsx", encoding="utf-8")
But i’m just getting a excel-file looking like this:
Answers:
Try printing out the laptop data. You will see that what is outputted is the same information in the Excel:
<div class="col-sm-4 col-lg-4 col-md-4">
<div class="thumbnail">
<img alt="item" class="img-responsive" src="/images/test-sites/e-commerce/items/cart2.png"/>
<div class="caption">
<h4 class="pull-right price">$1799.00</h4>
<h4>
<a class="title" href="/test-sites/e-commerce/allinone/product/544" title="Asus ROG Strix SCAR Edition GL503VM-ED115T">Asus ROG Strix S...</a>
</h4>
<p class="description">Asus ROG Strix SCAR Edition GL503VM-ED115T, 15.6" FHD 120Hz, Core i7-7700HQ, 16GB, 256GB SSD + 1TB SSHD, GeForce GTX 1060 6GB, Windows 10 Home</p>
</div>
<div class="ratings">
<p class="pull-right">8 reviews</p>
<p data-rating="3">
<span class="glyphicon glyphicon-star"></span>
<span class="glyphicon glyphicon-star"></span>
<span class="glyphicon glyphicon-star"></span>
</p>
</div>
</div>
</div>
The part you say you want to extract is the link, which is found here:
<a class="title" href="/test-sites/e-commerce/allinone/product/544" title="Asus ROG Strix SCAR Edition GL503VM-ED115T">Asus ROG Strix S...</a>
One way you could get the link is by finding this tag inside of the div
tag it’s located in:
for laptop in laptops:
laptop_link = laptop.find('a') # Find the title link
text = laptop_link.get_text()
print(text)
Then, to get the hyperlink itself as opposed to the text inside, you need to get the tag’s href
attribute, like this:
for laptop in laptops:
laptop_link = laptop.find('a') # Find the title link
text = laptop_link['href'] # Get the link attribute
print(text)
I have tried to loop some web scraping from a demo site Webscraper.io – it’s a demo site with laptops, where I’m trying to scrape the title of the laptop, the price and the link for the laptops. But I’m finding it very difficult to figure out, how to scrape all the information and exporting it to excel. Particularly how do I add the link to the current information?
Here is what I have done so far:
import requests
from bs4 import BeautifulSoup
from pprint import pprint
url ="https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops"
r = requests.get(url)
html = r.text
soup = BeautifulSoup(html)
css_selector = {"class": "col-sm-4 col-lg-4 col-md-4"}
laptops = soup.find_all("div", attrs=css_selector)
for laptop in laptops:
text = laptop.get_text()
print(text)
But i still need some way to add the link for the laptops as well… and some way to export to scrapoing to excel. ‘
I have tried to export the current data to excel:
import pandas as pd
df = pd.DataFrame(laptop)
df.to_excel("laptop_.xlsx", encoding="utf-8")
But i’m just getting a excel-file looking like this:
Try printing out the laptop data. You will see that what is outputted is the same information in the Excel:
<div class="col-sm-4 col-lg-4 col-md-4">
<div class="thumbnail">
<img alt="item" class="img-responsive" src="/images/test-sites/e-commerce/items/cart2.png"/>
<div class="caption">
<h4 class="pull-right price">$1799.00</h4>
<h4>
<a class="title" href="/test-sites/e-commerce/allinone/product/544" title="Asus ROG Strix SCAR Edition GL503VM-ED115T">Asus ROG Strix S...</a>
</h4>
<p class="description">Asus ROG Strix SCAR Edition GL503VM-ED115T, 15.6" FHD 120Hz, Core i7-7700HQ, 16GB, 256GB SSD + 1TB SSHD, GeForce GTX 1060 6GB, Windows 10 Home</p>
</div>
<div class="ratings">
<p class="pull-right">8 reviews</p>
<p data-rating="3">
<span class="glyphicon glyphicon-star"></span>
<span class="glyphicon glyphicon-star"></span>
<span class="glyphicon glyphicon-star"></span>
</p>
</div>
</div>
</div>
The part you say you want to extract is the link, which is found here:
<a class="title" href="/test-sites/e-commerce/allinone/product/544" title="Asus ROG Strix SCAR Edition GL503VM-ED115T">Asus ROG Strix S...</a>
One way you could get the link is by finding this tag inside of the div
tag it’s located in:
for laptop in laptops:
laptop_link = laptop.find('a') # Find the title link
text = laptop_link.get_text()
print(text)
Then, to get the hyperlink itself as opposed to the text inside, you need to get the tag’s href
attribute, like this:
for laptop in laptops:
laptop_link = laptop.find('a') # Find the title link
text = laptop_link['href'] # Get the link attribute
print(text)