Web scraping google flight prices

Question:

I am trying to learn to use the python library BeautifulSoup, I would like to, for example, scrape a price of a flight on Google Flights.
So I connected to Google Flights, for example at this link, and I want to get the cheapest flight price.

So I would get the value inside the div with this class “gws-flights-results__itinerary-price” (as in the figure).

figure example

Here is the simple code I wrote:

from bs4 import BeautifulSoup
import urllib.request

url = 'https://www.google.com/flights?hl=it#flt=/m/07_pf./m/05qtj.2019-04-27;c:EUR;e:1;sd:1;t:f;tt:o'
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
div = soup.find('div', attrs={'class': 'gws-flights-results__itinerary-price'})

But the resulting div has class NoneType.

I also try with

find_all('div') 

but within all the div I found in this way, there was not the div I was interested in.
Can someone help me?

Asked By: Andrea Barnabò

||

Answers:

Looks like javascript needs to run so use a method like selenium

from selenium import webdriver
url = 'https://www.google.com/flights?hl=it#flt=/m/07_pf./m/05qtj.2019-04-27;c:EUR;e:1;sd:1;t:f;tt:o'
driver = webdriver.Chrome()
driver.get(url)
print(driver.find_element_by_css_selector('.gws-flights-results__cheapest-price').text)
driver.quit()
Answered By: QHarr

BeautifulSoup is a great tool for extracting part of HTML or XML, but here it looks like you only need to get the url to another GET-request for a JSON object.

(I am not by a computer now, can update with an example tomorrow.)

Answered By: Punnerud

Its great that you are learning web scraping! The reason you are getting NoneType as a result is because the website that you are scraping loads content dynamically. When requests library fetches the url it only contains javascript. and the div with this class "gws-flights-results__itinerary-price" isn’t rendered yet! So it won’t be possible by the scraping approach you are using to scrape this website.

However you can use other methods such as fetching the page using tools such as selenium or splash to render the javascript and then parse the content.

Answered By: Hoossain Khan

Beautifulsoup without selenium will not work as the data is rendered dynamically.

To see and parse all the results, you can use click pagination that shows show more button:

button = driver.find_element(By.CSS_SELECTOR, ".XsapA")
driver.execute_script("arguments[0].click();", button)

You can define CSS selectors on a page using SelectorGadget Chrome Extension.

Check code in the online IDE.

import time, json, lxml
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup

URL = "https://www.google.com/travel/flights/search?tfs=CBwQAhooagwIAxIIL20vMDdfcGYSCjIwMjMtMDQtMDZyDAgDEggvbS8wNXF0ahooagwIAxIIL20vMDVxdGoSCjIwMjMtMDQtMTByDAgDEggvbS8wN19wZnABggELCP___________wFAAUgBmAEB&hl=it"

service = Service(executable_path="chromedriver")
options = webdriver.ChromeOptions()
options.add_argument("--headless")
options.add_argument("--lang=en")
options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36")
options.add_argument("--no-sandbox")
options.add_argument("--disable-dev-shm-usage")
driver = webdriver.Chrome(service=service, options=options)

driver.get(URL)
time.sleep(1)
button = driver.find_element(By.CSS_SELECTOR, ".XsapA")
driver.execute_script("arguments[0].click();", button)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(5)

soup = BeautifulSoup(driver.page_source, 'lxml')
driver.quit()

data = []

for flight in soup.select(".yR1fYc"):
    name = flight.select_one(".Ir0Voe .sSHqwe span").text
    price = flight.select_one(".FpEdX span").text
    
    data.append({
        "name": name,
        "price": price
    })
print(json.dumps(data, indent=2, ensure_ascii=False))

Example output:

[
  {
    "name": "Vueling",
    "price": "707 USD"
  },
  {
    "name": "Vueling",
    "price": "707 USD"
  },
  {
    "name": "Vueling",
    "price": "707 USD"
  },
  other results ...
]

There’s a 13 ways to scrape any public data from any website blog post if you want to know more about website scraping.

Answered By: Denis Skopa