Zillow web scraping using Selenium & BeautifulSoup
Question:
I need to do web scraping of 3 pages of California on Zillow of rent houses and put all the data into a pandas data frame. I need to pull all the features of every listing – Address, City, number of bedrooms and bathrooms, size of the house, size of the lot, Year built, rent price, rent date
My code:
from bs4 import BeautifulSoup
import requests
import time
import os
import random
import re
!pip install selenium
!pip install webdriver-manager
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
from selenium.webdriver.chrome.options import Options
options = Options()
options.add_argument('--disable-blink-features=AutomationControlled')
import pandas as pd
import scipy as sc
import numpy as np
import sys
req_headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.8',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951 Safari/537.36'
}
response = requests.get("https://www.zillow.com/homes/for_rent/CA/house_type/",headers=req_headers)
print(response)
soup = BeautifulSoup(response.content, 'html.parser')
print(soup.prettify())
listing_urls = []
listings = soup.find_all("article", {"class": "list-card list-card-additional-attribution list-card_not-saved"})
for listing in listings:
listing_url = listing.find("a")["href"]
print(listing_url)
listing_urls.append(listing_url)
I got stuck here – I get the following error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
~AppDataLocalTemp/ipykernel_24224/2055957203.py in <module>
4
5 for listing in listings:
----> 6 listing_url = listing.find("a")["href"]
7
8 print(listing_url)
TypeError: 'NoneType' object is not subscriptable
In addition, the code prints only 2 links for the whole page (every page has 40 listings of houses/apartments for rent)
Thank you ! 🙂
Answers:
Edit :
If someone’s looking for a good scraping method you should read this post :
https://medium.com/@knappik.marco/python-web-scraping-how-to-scrape-the-api-of-a-real-estate-website-dc8136e56249
It helped me a lot
Good luck 🙂
I need to do web scraping of 3 pages of California on Zillow of rent houses and put all the data into a pandas data frame. I need to pull all the features of every listing – Address, City, number of bedrooms and bathrooms, size of the house, size of the lot, Year built, rent price, rent date
My code:
from bs4 import BeautifulSoup
import requests
import time
import os
import random
import re
!pip install selenium
!pip install webdriver-manager
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
from selenium.webdriver.chrome.options import Options
options = Options()
options.add_argument('--disable-blink-features=AutomationControlled')
import pandas as pd
import scipy as sc
import numpy as np
import sys
req_headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.8',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951 Safari/537.36'
}
response = requests.get("https://www.zillow.com/homes/for_rent/CA/house_type/",headers=req_headers)
print(response)
soup = BeautifulSoup(response.content, 'html.parser')
print(soup.prettify())
listing_urls = []
listings = soup.find_all("article", {"class": "list-card list-card-additional-attribution list-card_not-saved"})
for listing in listings:
listing_url = listing.find("a")["href"]
print(listing_url)
listing_urls.append(listing_url)
I got stuck here – I get the following error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
~AppDataLocalTemp/ipykernel_24224/2055957203.py in <module>
4
5 for listing in listings:
----> 6 listing_url = listing.find("a")["href"]
7
8 print(listing_url)
TypeError: 'NoneType' object is not subscriptable
In addition, the code prints only 2 links for the whole page (every page has 40 listings of houses/apartments for rent)
Thank you ! 🙂
Edit :
If someone’s looking for a good scraping method you should read this post :
https://medium.com/@knappik.marco/python-web-scraping-how-to-scrape-the-api-of-a-real-estate-website-dc8136e56249
It helped me a lot
Good luck 🙂